Using Google's Data and AI Technologies with Kaggle (Cloud Next '19)
Hi. Everyone I'm Anthony Goldblum, I'm the CEO of Kegel, and, as of actually it was at, this event two years ago that we announced, that carol was joining Google so. That the Google. Next has a sort, of a nice sentimental. You know it's sentimental for us I'm just a quick show of hands how many of you have. Heard of Carol before well. I guess that's why you're here. How, many of you have a kaggle account. Are. Nice this is good. How many of you have either either submitted, to a competition written, a kernel or. Downloaded. A data set, okay. Cool, so hopefully. The. Percentage got a little bit low towards the end but hopefully this is so exciting, that by the end of it by, the end of this talk you're all dying, to do all three of those things. So. We. Run. As a our. Own brand and our own team within, Google you. Can think of us you know the you. Can think of us as, you. Know which we're a very, large machine, learning community we have 2.8. Members and actually, we. Were, 840,000, when, we were when we joined Google two years ago and so our growth is sort, of a reflection of, how machine learning has grown, and evolved, and. So one of the things we we've. Been we, think a lot about is how do we give our community you know we're a platform, neutral community, and so people who news whatever they like but. Google. Has a lot you know one of the really nice things about partnering, with Google is Google is. My. Opinion the world's best data analytics, machine learning company and Google cloud is. You. Know really moving towards being the best cloud for data analytics, and machine learning and so one of the things that we've been starting to experiment with a little bit is what. Are the ways that we can give the, capital community exposure, to. Google. And Google cloud tools and. So we're just at the beginnings of this journey a lot of our first couple of years at Google have been focused, on aligning. With Google infrastructure. Security. And privacy policies. You. Know as a small start-up you have a different stance on, on. Things, like gdpr, - you, know - a company like Google now, that a lot, of that work is. Behind us with, you yeah we're thinking a lot about what are the ways that you know what are the things that Google has that would be very exciting, to our community, and. So Rachael is going to give you a bit of a sense for some of the early things that, we have done and then you. Know as, Rachel said there's a Dory so we're very happy to take questions about, you, know what are the what, are the kinds of things we're thinking about and what sort of things might be coming so, I'm gonna hand, it over to Rachel thank. You and, if you haven't used any of the dories before it, should be in the app under this session there's a little like Q&A, button and, if you get that you can answer questions and, we'll we'll try to save a nice big chunk of time at the end because. I know calculus, is the best and you guys always have lots of questions and I want to make sure we have time, to address those. So. I think most of you know what Kaggle is based on the the show of hands the. Thing that I first. And foremost think about when I think about Kegel is the, community. That's really a technical professional, community, where people who are entering the space or have been working for a while can, make. Friends hangout hone their skills learn, things share, resources and. Eventually. We want to be able to, support you in doing all of your data science work on Kangol. What's. On cackle of people. Lots of people I know I mentioned the community a lot but I think calculators, are the best we've. Got over 2.7, million, users. At this point I think we have around 300,000. People who log in every month to you know write kernels, or our download datasets or enter competitions, we, also have 15,000. Public datasets so these are data sets that, we've. Uploaded or, mostly, that users have uploaded that they want to share so things, they've collected, research. Datasets we'll talk a little bit about some examples of those and. We also have a really large body, of scripts. And notebooks so this is code for, doing different, tasks, and, you can of course work, privately, keep your data private keep your code private but people really are, very generous, with their work and to share, it and want, people how through buck people through how did you very much learning tasks, so, if you're trying to do you, know something, kind of common let's. Say sentiment, analysis, or object, detection probably, somebody's already written to Colonel then, you can go give, a little up vote too you want to make sure you up vote you split Colonels and, then fork and you don't have to start from scratch you can use this pipeline that somebody else has already built and, it's a huge time saver I am.
Always Using other people's cry little kernels to to get my work started. So. We're probably best, known as a. Competitions. Platform, that's sort of how we got our start and we. Absolutely, still do that but these, days Kegel is a lot more than competitions, and I want to highlight some of the things that we've done recently some, of the changes, so. The, cattle products that I personally, probably, use the most is kernels. What. Are kernels I mean they're like 18 different things in data science and computer science so Kaggle kernels, are a hosted. In browser, coding. Environment, so we support Jupiter, notebooks that's by far the most popular, we, support the Python 3 and our. Kernels, we do not support Python, 2 and. We also have, scripts, so these are flat files and, we support Python, R and also our markdown, so if you're any, any R markdown, fans out here yeah, I, have, a big fan of markdown so. You can write and run markdown, on cattle and then you have a nice link that you can share with people. Or. I mean of course download and use blog down or book down or slide down or, whatever else you're using in the ecosystem. And. We also have once, you've created your kernels a really nice way to share it you make it public you can post about it on the forums, if. You haven't been on the site in a while we now have a personalized, news feed so we've sort of like algorithmically. Picked some kernels that we think might be exciting, for you or competition, announcements, we think you might like so. You can always see what's new and happening, and people, are talking about, and. There's. A progression, system for kernels and datasets as well as for competitions, so. Uploading, other people's data sets and, kernels sorry, uploading other people's, forum. Posts, and kernels, helps, them progress in the competition system, and say the community values my work that's just a nice thing to to. Learn about that. Others value you or your work or to to let people know that you appreciate what they do. So. Here is an example kernel that is a little bit small. That. I wrote and this is an arbor kernel I like our I also use Python and, this shows you how to read. In a wav file and create a spectrogram, and, if you're not familiar spectrums. Are a time. Frequency, intensity, transformation. Of audio, files it's a way to visualize, sounds, basically, and. This shows you the sounds, of ground. Parrots, which are an Australian parrot, that lives, on the ground and. A bio acoustics researcher, just uploaded, a data set of coals that they'd recorded, from these parrots and I was like I've got an acoustics background I think that's kind of interesting so. I just whipped. Up a little visualization, to show, what they looked like and. If, somebody else wanted to do spectrograms, we do have a sound bass competition, going on right now you. Could copy. This kernel of work this kernel, to. Read in a different data set remove, the data set that I'd used and just reapply this code to a new data set. So. I mentioned collaborations. And discussions, our forums, are super duper duper active. It's. A really great place to ask technical, questions to a community, of other data scientists, and of course there's there's lots of places to do that their stack overflow, their stats exchange, but. Particularly around machine learning, deep learning machine, learning engineering the cago community is really great and supportive. And. We also have events. So. Some of those are in person, and some of those are digital. So an example of an in-person event, is candle. Days this. Is a, picture.
Of Everybody, who went to Kaggle Days Paris, which is a little while ago we're. Currently actually having. A Google days right now in conjunction with next, which I'm afraid is full so if, you didn't sign up ahead of time, catch. Us next year and, it's been really great we've. Gotten talks by Grand Master's people have run workshops we're, doing a little on-site competition, that people are working on right now, with, with, manufacturing. Prediction, defect, detection, and, it's been a really fantastic way to meet, people face to face because you spend so much time chatting, with them online or. Maybe you just sort of like see them going past in the forums and you're like oh I don't know if I could ever approach them and it's been really great to meet people face to face so if you are interested in coming and hanging out with Kay glares keep, an eye out for title days. And. We also have online events, and we, know that we have a lot of users in. Different parts of the world for, whom it may not be reasonable, to, get to say Paris, or San, Francisco so, we really want to make sure that we're reaching everybody and we're really being accessible, and even if you are you know in the US and you have small children and can't travel we want to make sure that you. Can still grow professionally as, the data scientist, so, one of the things that we're doing next week actually and you can still sign up for this is career. Con and this is a completely, digital event to, help you land your first data science job so. I'm doing an educational event as part of this to help people who've never made an API make. An API from scratch we're gonna talk about open, API you may know it as swagger we're, gonna build a little flasks app we're gonna serve it we're gonna go through the process so. You can have something to show, on your portfolio your best mates say hey sure. I'm a data scientist, it's not really good at machine learning but also I know a little bit about what software engineers do and I can play well with them and my my new teammates, and you should hire me and I'm great. Data, I was. Actually originally hired to be on the, data sets team and. We used to like the Kaggle team individually go and find interesting data sets under open licenses and. Upload, them to Kaggle we, don't need to do that anymore because our community, has been so, so so fantastic about, sharing. Data research data just, people who have questions that they would like to get in front of machine learning researchers, that they've collected the. Data for, and. If, you have data that you would like to share I would encourage you to put it on Pagal. Assuming. That we can support your specific needs for example we cannot support HIPAA compliance, please don't, give us HIPAA data yet. So. Here are some example data sets that the, team actually automatically, made. And, these are our open datasets that we think might be useful, in your day to day work so. Some things there's. One in here that San Francisco restaurant scores, like their health scores so if you're trying to decide where to go for dinner and you want to do a quick little data analysis, to make sure you pick the right place that might be something you can work. On, but. We also have some really fantastic user, uploaded datasets so, you guys might be familiar, with km, NIST it, was uploaded by a couple of our users including, anarchists who's, our youngest, grandmaster, currently. Cu. And. It, is a drag-and-drop, replacement. For a mist which is a digit. Recognition data, set does anyone not familiar with feminists. Yes. A few okay so it's a data. Set for for recognizing, handwritten digits, like 1 through 0. This. Is the same thing but for Japanese, classical. Texts, so this, team put together this data set they released it on Cagle and, they actually presented a paper about it at Europe's, which is what nips is called now I'm. Not, saying putting data sets on cago look at you nerds papers, but, it's a really good place to find interesting research data set that's specifically, tailored towards the machine learning community, which is fantastic we. Also had, relatively. Recently some farmer. Who wanted to train a drone, to. Spritz. Some. What's. It called not, pesticide, but like for plants fertilizer. No, the other one. Herbicide. Thank you herbicide, on a. Specific, weed. That, he had so we've gone through his field and he took like a bunch of pictures of the specific weed he wanted to target and, uploaded, them so that people could could try their hand at a really. Fun, practical, application, so there's, lots to discover in datasets I love just like going through there and seeing what people have uploaded.
And. Finally keidel is a great place to challenge, yourself, of. Course we have competitions. They are as you might imagine competitive. We've, got many going on right now we've got images. Lee text we got a sound one going on right now so whatever it is you're interested in, as long, as it's supervised we'll. Probably have something that might tickle. Your fancy. We also have a, lot of support for people who are learning who are newer to the data science space who. Maybe just aren't compelled to compete or don't find that particularly, motivating. I personally, tend to be more motivated by finding out new things than, doing, like you know gotta get that last hundreds. Of a digit, of accuracy, and that's just my, personal learning style. So, here is an example competition. I know I've talked about career Khan a lot but I'm really excited, you guys, here's. A competition that we're going to be doing as. Part of career Khan on robot. Navigation so. If you're interested in sort of agent base stuff this might be one for you to take a little look at. And. We also have a. Growing. Number of learn courses, and. These are not gonna take you from you know completely. Through the map in a very detailed way our goal with learn is to get you started very, quickly with practical, examples, so these are short they're condensed, and we want you to be able to go through them and say an afternoon, and then be able to apply what you've learned so this is one that's relatively new on machine. Learning explain. Ability, that, diem Becker recently, released so. If that's something you're interested in sit, down spend a few hours work through some hands-on examples and know more maybe, unless you're already researching, machine learning excitability, in which case I don't know why you take a course in it. Okay. So that's what's on the platform I also. Want to talk about some, of our, Google integrations, that we've already added and. Something. That I was doing is I was putting together this talk was really sitting, down and thinking how, will these be helpful, what specific, people can I see really benefiting, from these integrations, because we want to make sure that we're doing things that people, can take with them and enjoy and, improve. Their work with. So. Three things that we've added relatively, recently, one. Is, an integration. With data, studio, who, used is who has used data studio. Okay. A fair number of people. We've also added, an integration, with bigquery. And. Also Google sheets which I'm actually very excited about and I'll talk about why I know people like to dunk on like spreadsheet. Software but it's a great tool you guys. These are great all. Right data, studio. So. We, have a, little. Gift here that's apparently, starting, in the middle. For some reason. Interesting. So you, can click on any, title data set and, launched, with. Just a couple button clicks once you've connected your data studio account, a de, Dios to do dashboard. Based on that, data if. You're not familiar with that data studio is a way to visually. Analyze data it's. Really nice for creating, dashboards in. Particular, and, one. Use case where I think this is gonna be a huge, time-saver is for. People who are working in educational, settings so, particularly, if you want your students to, really. Analyze the specific data set just as an example the Arizona Secretary of State releases. Their data sets of. Election, data as they finish cleaning them on kaggle so if you're a government, professor, and you want your students to spend some time looking at election, data I. Don't, know how many of you have done mapping, or like GIS, stuff in Python or are ya, is it like fast and easy and something you can do in 10 minutes usually, maybe.
Some Of you can but me I'm always like ah got a pic of projection. Oh no there's no FIPS code I got to go get the FIPS code from somewhere or oh I've got to hook up a mapping API because I don't have like open Street on whatever I'm doing so. I think this could be a really, fast way particularly. For mapping especially. If you want your students to be focusing on the information in, the map and not the. Going through of the mapping. Also. I mentioned dashboards. I love. Jupiter, there are some things that Jupiter is really, great for dashboarding. There are some things where it's like a little bit less good so. Something that actually. There's a front-end people in here who, would like to contribute to Jupiter there's an HTML, widgets bug that I don't think the core team is gonna get to cuz they're focusing on Jupiter lab it's been around for a while and. Every so often different. Plots will render as one pixel by one pixel has any wandering into this, anyway. When you're working with interactive, plodding libraries, that can happen and. If you run into that again. It's just in the Jupiter core, and there's not I personally. Can do about it cuz I'm not a French person if. You are consider. Making open source contributions on this particular, vlog that bothers me but. Using data lab instead, is a nice way to work around that and it's also nice for your your teammates who are maybe not, so comfortable working with code I know. That a lot of us work in teams with people with variety. Of backgrounds. And goals so this, is a nice way to reach, out and and, work with them. Bigquery. Who. Uses bigquery. Ok. I've heard of people uses like any relational, database or sequel. In general, yeah. Ok pretty much everybody. So. Right. Now you, can access public. Bigquery, datasets, through, chi bowl kernels and, we've got some very small code examples, I don't know why I thought these screens would be bigger I think I thought the room would be smaller so everyone would be a little bit closer so, you can create a bigquery client and run. Queries. Wow I should have had a second coffee and run, queries from directly in kaggle notebooks and, then take that data and work with it locally save, it out as another data set so. If you save any files from a Jupiter notebook when you commit, that notebook it'll run it top to bottom and, your files would be saved in the output and you can create a new data set from those so, if you're running a big query or. Maybe you have a teammate who's not as familiar with sequel, but they want to use some of the the, bigquery public, data you, can run the queries create a data set from them and then point them to the data set which. Is really nice. And. If you are perhaps the person who is like I don't like, sequel, or maybe it's been a minute since you used it we, also have a, sequel course again very directed, we're not gonna get into like you, know the parse trees of sequel, and have sections action not doing any of that it's just basics. Of query writing, and specifically, talking a little bit about bigquery and how to use it on Kaggle so if you're a little trepidatious, about getting started we. Can help, you out with that and walk you through it. Just, for those of you who didn't know we, have a kegel developer, survey that we reach. Out to the machine learning community, in large and sort of get information, back.
And Sequel, is consistently, the third most popular, language after Python, and are so. If you are not a sequel user and you're working the machine learning face that's good to know. Finally. Google, sheets I really. Like spreadsheet, software I think I mentioned it before I think it's a. Really. Nice tool set that makes a lot of things very except, very, accessible, to people because you don't have the cognitive over of. Trying to remember, what data structures, look like because. It's there visually, so I know. People don't gone on on spreadsheet, software but I'm a big fan so. To open. Datasets, in a, Google. Sheet you can do that I, don't, know if you guys can see it's. Under that little sort of hamburger, sweet menu, in the upper right hand corner, and you. Can open it as a Google sheet it takes, just a second and then it'll pop open. So. Two, places where I think that this might be a, sort, of a surprising, news is. First for project, management so. If you've got I mentioned that you can create a dataset from the output files you generate in any kernel so, let's say you're working with some competition, teammates, you're, sort of deciding what your hyper parameter search space is going to be you generate a file with sort of all the parameters you, create a dataset from it launch it as a sheet and then you can use that as your sort of base of operations, so you know color code who's doing what you. Can ping, people, so you can be like this action item is for you and they'll get an email which. Is really nice so. It can be a nice project management. Tool because I know that that's something that a lot of people struggle. With and I see a lot of discussion, about on the forum so if you don't already have one and you're working with other people sheets might be a nice integration and. A, really hacky one I'm sorry guys this is extremely hockey but I think would be useful, so. One of the ways that people do text, data augmentation, so you have a small data set how can you create new. Data. Items, for images, you can you know flip it and rotate it what does it mean to flip the sentence it sort of stops becoming language. One. Way that people do this is they'll translate, from English into an unrelated, language, like Japanese or Russian and then back into English because you'll get sort of like slightly. Different sentences that are kind of semantically, equivalent. And I don't know if you guys know this but there's actually a Google Translate. Option. In sheets now so one, way to do this would be to create a data set of the text data you want opening. Sheets translate, it using sheets and then create a new data set from your translated, sentences, for your data, augmentation, I did, warn you always hacky but potentially, very useful if you need to do text augmentation. So. Those are some of the, integrations. That we have worked on so far. More. To come TM. Veil, of secrecy but. There's lots of exciting things that we're working on that I think you'll really like and find useful. And. We, really want to make kaggle a great place to do data science so I encourage. You during the Q&A I haven't checked the dory hopefully everyone's asking questions, so, let us know your feedback if there are specific things you want to see we would love to hear that we. Really want to make cago a great tool I do almost all of my data science working Hegel and, personally. Would like it to be better for selfish reasons I think, it's already very good but I'd, love to hear what else we can do to help you guys. Alright. And, also there is feedback, in, the app so. I'm gonna wrap it up and. We can take questions.
2019-04-26 13:47