Tackling High-Value Business Problems Using AutoML on Structured Data (Cloud Next '19)
All. Right my name is teen Yun and I'm a product manager on the cloud AI team and today. I'm here to talk about how you can tackle some of your high value business. Problems using. Auto ml on one of the most common, types of data in the, enterprise, structured. Data. But. Before we dig in I wanted to first make sure everyone, knows what I'm talking about when I when, I say, a machine learning on structured data right so, these use cases typically, start with a table of input kind. Of like this where. Each row is an independent, training example, and usually. One, of these columns, is the target, column and the, remaining columns are input features and the, goal is to create a machine learning model to. Predict the value of the target column accurately, using, the available input. Feature values for, that row so. For example here you may have a visit. Oh example, you, might have a table of historic, offers. From an online marketplace and you, want to create a model that takes the available data on that offer to, predict the price at, which that offer is sold so. For example if you want to provide pricing, guidance to your sellers and one. Common misconception I, wanted to point out is that people, sometimes. Think that you can only have simple, values, like numbers and classes inside a table but. Actually especially, with modern data warehouse systems like bigquery, you can put a rich set of things in there things like timestamps, long, pieces of texts. Lists. Of features, potentially. Repeated, nested fields and that would be only. Getting started. So. According to the McKay the McKinsey, Global Institute, data. In this basic form structured. Data is likely. To drive most of a i's impact with, time series, being a close second and this. Really comes as no surprise because. Virtually every industry has, mission-critical. Use, cases that can be boiled down into. This tabular form so, for example in retail. Predicting. The likelihood, of stock outs of predicting price elasticity, so you can optimize your product inventory in finance. Predicting. The risk of large claims or defaults, or predicting the likelihood, fraud in the next in this transaction in, order to manage your risk in. Marketing, predicting. Lifetime, value like how much they'll spend on your site in the next three weeks predicting. Whether they'll churn so, that you can better understand, your customer the, list goes on and these.
Are Use cases that are so core, to their respective industries that. Even small improvements, in model call in model quality can. Have significant. Significant. Business implications. But. The challenge, is that especially, as the number and complexity. Of your input, columns increases, you, hit a combinatorial, explosion. In things, you need to worry about so, starting from the left and going to the right on data, preparation for, every one of your individual. Feature columns, you need to think about things like missing. Values and outliers and so forth and then, under feature engineering for each of those columns you need to think of the right pre-processing. To prepare it for the following model, that you select in the architech architecture, selection step and there, could be multiple options, per feature column, and then. Under architecture, selection, there are dozens of models you could choose from including. More that come out from the research community basically, on a monthly basis, and then. For each of those models you select you have to select the right hyper parameters, and there could potentially be a dozen values to set and then. You have to think about tuning, think about ensemble, if you're trying to create an especially, good model as, well as model evaluation, and oh, by the way if you, screw up in any one of these individual, steps frequently. You have to start over and. This process of iteration, can go on for tens of times especially. If you have a hard a hard, data set I think we've data. Scientists in the room have probably all experienced, experience, that before so. This entire process can. Take months, potentially. Dooming, machine learning projects all together as, executive. Sponsors, lose, interest. So. In order to help you overcome these challenges we, decided to build Auto ml tables, a tool, for enabling your entire team whether. You're a data scientist, an analyst, or a developer, to, automatically, build and deploy state-of-the-art. Machine learning models on structured, data and I'm, excited to announce that autumn, tables is entering into public beta as of. This morning. Thank. You and the way the product works is we provide a graphical coalesce, interval interface, for guiding users through the entire n10, machine learning lifecycle, with, significant, automation, as well as guardrails, built in at each individual, step which, I will show you shortly. Inside, a demo, so. We, start with helping you ingest, your data and then easily define your data schema and target analyze. Your input features in, a feature statistics, dashboard, automatically. Train your model including, automated feature engineering model selection and hyper parameter tuning as well. As evaluate, your model behavior before deploying, into production and then, deploying in a single click through. This we, can help ensure that what used to take months of time only. Takes weeks or even days. Digging. Deeper into some of these different steps so on, that on the data ingestion side we. Enable we seek. To handle, data as found, in the wild so we provide automated feature engineering for. Essentially. The major data primitives, that you can find in bigquery things. Like numbers time stamps classes. Lists strings and nested fields we, do know there are more other data, primitives out there that we could cover but this is this is our starting point and we, also worked hard to ensure that we're resilient to and provide. Guardrails, 4 imbalance, data missing. Values highly correlated features high cardinality. Features like. If every row had a different ID and outliers. As well. After. That we automatically, search through Google's entire model zoo to, find the best model for you including. Linear logistic, regression, for the smaller simpler. Data sets as well. As things, like deep. Neural Nets ensembles. As well as well as architecture, search methods, if. You have a larger, more complicated, data set and one of the benefits of being at Google is we sit next to some of the best research, and consumer, product teams in the world so, we can basically watch whatever, great research results come out and take. The best and put it into automail tables for you kind of cherry-picking the best for you, sometimes. Even before results, are published on them, in research papers so hot off the press an. Example, of this is our collaboration. With, the Google brand team so. They. Delivered what they call Nero, and tree. Architecture search, so, essentially, what this is that is they took their architecture, search capability, similar. To the one that they use for image classification, and translation. Problems and they, added tree, based architectures.
To The existing, neural net architecture, search space and then, also added automated, feature engineering so that it can work for a wide variety of structured. Data note, that this isn't published in research papers yet so I can't give more details but expect more to be announced soon. And. Based. On the benchmarks we've done the. Results are the results speak for themselves, so. There are a large. Number of vendors in this space and we chose to benchmark against a subset, of them with similar functionality, and we, bench dart we, and we benchmarked, on kaggle competitions, which, I love as a benchmark because they involve real data from real, companies, that, are putting tens to hundreds of thousands of dollars of prize money on the line to, find a good solution and they're, willing to wait months to, get a good answer that's typically how long these competitions take and thousands. Of data scientists, across the world compete, in them so. These are if you do well on a Kangol competition, in terms, of the ranking, you're, pretty much guaranteed that you're doing state-of-the-art. Work on a problem that matters for the world so, that's why we love this class of benchmarks, and. Here. You have on the x-axis different data. Sets that we different, Kaggle challenges that we benchmarked, on and on, the y-axis is. The percent ranking, on the essentially, the final cago leaderboard. If, we, had actually participated, so these challenges. Had already ended by the time we participated, we were just benchmarking, on the final leaderboard afterward. After the fact and. What. You'll see is that. Most. Of the time. Automall, tables gets into the top 25% which. Tends to be better than the existing vendors we tested but, caveats. Apply of course for some of these data sets we were in the middle of the pack and we'll be the first to admit that we have a lot of additional work to do and we're constantly, tuning. And improving, systems but, when it comes down to it in, general we do pretty well. So. Benchmark. Results are here and. I. Wanted to dig into one. Of these benchmarks, to, give you a better sense of what it means to get into the top 25%, so. Here's, an example called the Macari price suggestion challenge Macari. Is Japan's biggest, community powered. Shopping. App and marketplace, and they, created this challenge for, predicting the price of a product offered, on their marketplace so. That they could give pricing, suggestions, to, their sellers so, this is a real version of the example, problem that, I brought. Up at the beginning of this talk so. Some, logistics, this, challenge, went, on for about three months there, was, $100,000. Prize. 2,000. Data scientists competed, and actually. The winning data scientist, made about 99. Entries, in. Order to win this competition so, it was highly contested it's, not it's not just like a pass-through easy challenge right, and the. Data look kind of like this so about 1.5. Million, rows of offer. Examples, like this with, some rich, input. Features right, so you've got the. Name there, are, categorical. Values, lists. Of categories. Item. Descriptions, and some of these are really like, this is dirty data right you've got no, description, yet what, does the model supposed to do with that you've, got things like redacted, values, over here right.
This Is and there were missing values as well this is a good real-life, data set and the goal is to break to predict the price on this column. Now. Here what you see is a curve, where we laid out the performance, of each participant, in terms, of the final error achieved. So. The. Further you are to the right the, better that participant, did on the final leaderboard right, and the, higher you are on the y-axis the. The. Higher the error on the final test set so as expected there, is a line, that slopes downwards, to the bottom right, but. Importantly, it isn't, a consistent. Curve so what you'll see here is that in the in the beginning there's a steep decline right. We're. Better. Feature engineering better model selection and all that like it really matters there's still more signal. To be squeezed out of the data but. Then there's a long plateau where, basically most of the competitors, have kind of done whatever they can to. Squeeze. The signal out of the data already so, getting to the top 25% actually, gets you to that photo right and. Here. Is how auto melt tables does after, different, numbers, of hours, of training, so the way tables, works is you select the number of hours and here, we have, tables. After one hour of training twelve, hours of training and, 24 hours of training so. Caveats, right the, architecture. Search the hyper parameter tuning process these are all, random. So. If you if you run this yourself it might be a little bit more a little bit less and we did do some limited, data cleaning so for example previously the categories, were split by slashes, so, our system just treated as one big token, so, in order to treat it as individual. Words you actually have to break up, break. Out the slashes by replacing them with whitespace but, this is basic, very basic stuff, right, and, of course this, is just the Macari challenge other challenges, might have different results but, still for. A million, row data set a million plus row, data set with significant, complexity. One. Hour of training by automotives. Gets. You here, one. Hour of training and oh, by the way as I mentioned earlier the, competition, prize for this was a hundred thousand dollars and in, comparison, one, hour of training on automail tables is going, to be 19 dollars and, by. The way you. Click one more time well. So the the winning model this. Is well known for capital competitions, the winning model typically takes months of time to deploy, to actually production eyes and deploy but, for automobiles, with one more click and you, can deploy the model into production so. How's. That sound is it a good deal. This. Brings me to my final point which is by using automata Bowles you'll, save money not. Only does it increase your team's efficiency, but, also there is no large, annual licensing, fee you basically P you basically pay for what, you use, and. How we calculate, what you use is that it's just a small margin over, the cost of the underlying, compute, and memory infrastructure, used so. For example when I said earlier that training costed $19, per hour that. Basically represents the cost of 92, for. CPU machines, on. Google, compute engine we're. Using equivalent of that essentially. Under the hood and prediction. And model deployment would be even even cheaper than that. And. Regarding. Reference customers we've had where were thankful, that we've had a significant, group of customers that, agreed to join our alpha pilot with, a number of them agreeing to be public, references, listed, here we're, proud that we've been able to prove our relevance across geographies. Industries. And company sizes you can see there are a wide variety of, company. Types here and profiles here with wide varieties, of also sophistication. In. Terms of data science so, you'll, hear specific, use cases from fox sports as well as gene P Seguros shortly. But. Before we get too carried away I'll be the first to admit that we are not perfect for everyone so, automotives. Is usually, the best fit if you meet the following criteria, so. First of all you. Have to know how to create good, training, data and this. Is this is probably the most important point right somebody. In the room needs, to know how to take a business problem, and translate. It into an ml set up a classification. Or a regression problem of, some sort right binary, or multi-class classification. And. That. Person needs to be able to take training, data especially, defining. The label column the target column so, that it represents. The, way you want the model to perform in production and somebody. Also needs to understand, how to make sure that the input features, are the, same in terms of distribution are, the same as what, is available at, serving. Time so that you can prevent training serving skew and a trivial example of this, if the training data was all in English but then at serving time you were feeding Japanese, to the model obviously, it would fail.
Second. You, need to be willing to wait at least an hour so that's our minimum unit of training is, one hour and if. You're looking to do minute by minute iteration. On your model potentially, so you can move quickly maybe it's very early in the in your development phase then. You're. Probably best off using something, else like bigquery, machine, learning or cloud. Machine learning engine and, finally. For. Now I want. To call out specifically, for now if you, have a larger, and more complicated data set that, that. Would be ideal so right now if your data set is more, than a is less than a hundred thousand rows, you. Won't be fully. Taking, advantage, of the advanced algorithms, that we're using and we're actively working on this problem and expect. More. Announcements soon but for now the, bigger your data set the better and the. Flip side of that is, our. Current limit is set at a hundred million rows just to be conservative but, if you have a data set that goes beyond that actually we're using Google's, machine learning infrastructure under the hood so we're built for we're built for much bigger than that and, we would love to talk to you if you have a data set like that so please reach out to your Google Account Manager. And. Now. I will, go over a demo, one. Sec let me load this up on my computer. Ready can we can we do the demo. Great. So, now I'll walk you through the workflow. Using Auto melt tables, you. Start by ingesting. That table, of data that represents your that, represents your problem right so it, can be from bigquery or, from Google Cloud Storage so, you just select it from Google Cloud Storage you select your table so, you want to use this one and, then. You ingest we've, already ingested it to save everybody time, so. The next step is after you ingest you, will see a schema, page. Like this where. You can fix the variable type if. It's set to be the wrong one we do automatic, schema. Inference but you can update it if it. Needs to be updated right, you can also change the columns in, case. They could, potentially be missing right, so maybe, a prediction time one of these columns might be missing so you can turn on that it can be nullable and. Then. You select the target. Column so for example in this problem actually the target column and, the target is supposed to be, deposit.
So This specific data set is about, binary. Classification predicting. After, marketing outreach whether, somebody will make a deposit at this Bank or not so, it's a classification problem and the and the target is deposit, and if you're an advanced user there are additional parameters. You can set so, for example you could set your own data split between training. Evaluation, and holdout sets. If. Certain rows are more important than others you can set away call them and. If this. Data. Has. A, ordering. Over time that matters then. You can set a time column and this will make sure that the latest data shows up in the holdout set instead of the training data inside. Instead of the training split. Next. After you've set your schema. We. Provide a dashboard. Of basic feature statistics, so, that you can detect, common. Data. Issues so. For, example we allow filtering, by the. Feature. Type. And there. Are more options available but this particular data set just has an American categorical, values to keep it simple and. You, can potentially, see, ok what percent of them were missing maybe that could indicate, system. Problems you, can see how many distinct values there are so four categorical values especially, if you, have a, cardinality. The number of distinct values that is similar to the number of rows then, that probably means that that column adds more noise than signal, and you should remove it and, then, finally you contend it detects. Argot. For, target leakage as well. As for numeric. Numeric. Feature, types seeing the basic distribution, and, also. If you deep dive on any of these particular. Feature. Columns, you can see things like the distribution of the values the top correlated, features to, it and. And so forth there's what, will keep on adding more as as you, asked for them more. More feature statistic, displays. Next. After you're comfortable with your data you can train a model so. You. Set your model name you. Set how many hours you want to train for and. Usually as, you soften the Macari example, one hours is a good place to start you. Select which feature columns you want to use. We. Automatically, set an optimization, objective, for you but, if you, want to set. An objective yourself, because, you know that one of them lines up better with your business goal you. Can do that right. And then you train a model, I've. Already trained a couple here so let's deep dive into one of them. Next. Is the evaluation tab, just, so. We provide, both the overall, performance, of, that model here on the top as well as some useful, metadata like what features were used and what it was set.
To Optimize for as well. As allowing you to export, the predictions, on the test set to bigquery if you want to do additional analysis, as well. As. Evaluation. Metrics, on individual. Label, slices, so here for this data set the value 1 means. That they, did not make a deposit at the bank and that's, 90 percent of the data right so you'll, see that you'll, be able to see that here's the model performance for that slice. Of the of the data and. Then. 2, means that they did make a deposit so that's the minority, class and, you'll see that it does slightly worse but, that's that's typically common for, imbalanced data sets, and it's up to you to figure out how you want to what, are that's efficient or whether you want to try other things like improving. The weight of the imbalance class and. So forth so, if. You flip between them and kind of see the difference right, you can set the score threshold, to. See how the precision and recall are at different points on that, curve, if. You're, interested in exactly how, miss classification, is happening, on the model then, you can look at the confusion matrix to. See that and then, we have feature. Importance, just to get give you a sense of which, features were, the most important, for. Making those predictions and. Finally. In the predict tab we. Allow batch prediction, and online prediction so. With batch prediction, you can upload a table from bigquery or from Google Cloud storage and export. Your results there as well and for. Online prediction, it's. Like I said it's one click to deploy your model we took a ploy globally. To. Make sure that you get low latency, serving. And then. You can test your model here inside the dashboard. So let's hopefully. The demo gods are kind to me and this is working. Great. So you can see the prediction result right and, for regression. Problems, we would give you a prediction interval a 95%, confidence interval. And. For classification, we will provide confidence scores for the different labels and. That. Concludes, my demo can we move back to the slides. Great. And. Now. I'm excited, to introduce Chris, and Jack to. Share how Fox Sports is creatively, using auto mail tables to, deliver a brilliant, audience experience. So all. Right I'm clicker. G'day. My. Name is Chris Pocock and, first of all I'd like to say thank you to tin young and the team at Google for inviting. Us over from, the other side of the world to come and share our story about Monte.
I'd. Also like to introduce Jack. Smyth waving. Over there he'll. Be talking, to you again in a few minutes time so look I look after my, boxing for Fox Sports Australia, and it's. Kind of a privilege. To. Be able to get paid to. Talk about sport all day long so it's a pretty cool job. So. I am. Actually originally. From the UK I've. Lived in Australia, for about five years so, I want to start off by giving. A, foreigners. Context. On what sport means to, Australians. It's. A sporting nation it's kind of the little guys punching. Above, their weight, it's. Not just part. Of the life they live and breathe it it's, part of the psyche, it's part of their national identity they. Identify, themselves through sporting success so, sport is a really, really big deal in this, country to. Give you a little bit of context, looking. At the Greater. Los Angeles area. There's 18 million people that live there they've. Got about 10 professional, sports, teams by my reckoning, Sydney. Four. And a half million people pretty. Big, we have 25 professional. Sports teams so, there's, a lot of sport to go around, the. Other kind, of contextual thing is that. Australians. Have kind of looked at international, sport and they've looked, football. Soccer from, Europe and then, they've looked at American, football gan naps not for us we're, just going to invent our own sport so, they've invented Australian. Rules football is probably. One of the most popular games in the country only played in Australia, professionally. But it is a big, big sport for them so, big they're in this little country of about 25 million people they, pack out a stadium. 90,000. People week, in week out so it's pretty impressive, but. Australian, balls footballs not the only sport played Australia. Is actually a pretty divided, nation when it comes to sport so, I'm just kind of highlighting up some of our key sports we have here so we have a rugby, union at the top Australian. Rules football rugby. League and motorsport. Those, are the key broadcast, sports that we have on the Fox Sports but, within Australia, itself certain, sports have bigger popularity. Than the other depending. On where you live or where you went to school, rugby. League for example, really, big in Sydney not. So much in Melbourne Australian. Rules football really, big in Melbourne not, so big in Sydney. So it's a divided nation so. It kind of takes us through our journey Foxsports. Where we have all these key sports. From. January, through to September it's a great lineup there's, a bit of a gap in the summer so. Last, year we, signed a broadcast, rights deal for, cricket and cricket is. The summer sport you'll notice that it. Kind of looks at October December we're upside down that is our summer time so. That kind of filled the gap and cricket, it cost, us a billion dollars for the broadcast, rights for that so it's a little bit of. Importance. On making, it work so, we had to, look. At cricket from a context of it if having been on free-to-air television. So. Australians been able to watch it for free since day dawn we're now asking you to pay for it so we had to promise a cricket experience, like never before and this, is kind of where our, auto ml tables came in, now. I'm conscious of where I'm standing and what country I'm in so just want to talk a little bit about what cricket, actually is.
It's. A game that we sometimes play for five days and no, one wins so the kind of thing that blows a few minds in Australia we draw quite a lot, but. It can be like a chess match it so it's really really, tactical. And what, you'll see is kind. Of the, last man holding on against, big, Goliath fast bowlers, or. The game changing. On a dime with three quick wickets is it, does explode into action but realistically, there. Are only 18, seconds, of wickets so the wicket being, this. Guy here the aim of the game is for this guy to bowl this guy out that. Really. Gives us across five days 18 seconds. Of action it's, a long long game so, what we try and do, with. This experiment with, the automail. Tables is to, warn fans, when we think a wicket is going to happen ahead. Of time so they know that they need to be in front of the screen don't, go to the bathroom don't go out and do a barbecue, because because the action is coming up now. Cricket, also, is a really. Great game for structured data there's. A lot of different variables that we can measure to accurately, predict what we think is going to happen in a few that, types of ball balled you, spin you can do fast, how. Long is the batsman there being at the crease has he been there for an hour has he been there for four hours is he tired not tired as he run around a lot is, he the last man standing generally. The last man standing is, a lesser. Batsman, versus the first guy who goes out what. Field positions, as the captain set out around where they looking to catch the ball and about, 80 other data points Oh makes it an ideal experiment. That we tried out last summer and. It bloody worked which was quite, impressive. So. Look mmm I want, to let, Monty, so just. A little bit of context we named him Monty, because the first wicket he accurately, predicted, was a chap called Monty Panesar so, it seemed apt to. Name him after that so. I am going to introduce you to this is a world first a world's, first automated. Machine, learning commentator, I'll let Monty give, it to you in his own words. Some beauty. Absolutely. Brilliant. Okay. All. Right thanks so much Chris, over, on my name's Jack oh I'll, wait for applause ah. My. Name is Jack and I'll walk you through how mine share created, Monty in collaboration. With Google now the. First step is a final world-class data provider in Optus sports so, they attract eighty-three, unique variables, with every single ball and that's available within seconds, of it leaving, the bowlers hand so, we knew we had great training data and a live feed set up so after that we could start experimenting with tables, now, say the first impressions, of tables were striking, the, speed and simplicity of, this platform blew, us away we. Could easily ingest, a year's worth of training data simply, select wickets, as a label to predict and within hours, we were seeing impressive, results to. Be totally honest some, of the team members even thought that Monty must be cheating because the earlier results was simply that good. However. Those doubts quickly, evaporated, as we move through the training process tables. Allowed us to collaborate, a model that could keep pace with a live game so, we moved ahead with the classification, model so we could not only predict when or wicker would fall but, how and, the final output looks, something like this this is a real example from a recent game between England, and the West Indies and you can see the endpoint would return an analysis of the latest ball and in, this case the batsman safe for the next five minutes the prediction is not out but, then you can see the confidence scores for each method of taking a wicket and there's, a rising. Danger, there of him being caught so. Once we saw the accuracy, during testing, we knew the monty was ready for the main stage and I, have to say as Australians, we, don't really believe in soft launches so we, chose the biggest, most iconic, match in the Australian, sporting calendar to.
Debut, Monty now, millions. Of fans of the MCG. We. Come here every boxing. Day for, Australia's, biggest. Day of cricket day. One of the Melbourne Test match more. Than 80,000. Fans in, the stands, millions. Watching and listening, around. The world, magically. Drawn. To, the history and. The. Tradition. This, is where reputations. Are made. This. Is where Dennis Lillee knocks over fever, idiots with, the last fall of the day. You. Don't forget that as. A. Kid you dream of doing something amazing at the g and a, few get. Lucky right. So, how did it go. Monty. Absolutely. That. Was seamless so. How. Did he go, Monty absolutely, smashed, expectations. In his first game we were thrilled with, the accuracy and tables. Really came through when it counted most when, we have most of the country counting on Monty's call so. He's confident scores for each wicket, were displayed live in ads just like this for millions, of fans and, that stretched from push notifications. The Fox Creek adapt all the way through to pre-roll and we were thrilled, without accuracy, figure that is astonishing. For live sport and, I would say as well that even the moments he missed became, highlights, in their own right because, that meant that a bowler had come out of nowhere to take a wicket without warning they'd literally beaten, the odds now. As our confidence grew in Monty we use then essentially, scaled, him up to become the command, center of our entire campaign, so, this is the final architecture. And this, is largely thanks to an amazing Googler back in Sydney named drew Jarrett, and in the time remaining I'll give you a quick overview of how it worked, down. Here when a ball was a pass, through our system from ophtho App. Engine would receive that it would pass it through for processing, we'd use data flow to look at that ball and also, an aggregate, of the recent balls to give us what we called a prediction, window the, model would then make a prediction, based on that window and the individual, ball and then App Engine would facilitate requests.
From Live, digital, billboards, from the Fox Creek adapt Google. Ads bidding, scripts studio, dynamic, templates and even, Google assistant, and I. Have to say for me the Google assistant experience was a particular highlight this, was combining the predictive, power of auto ml tables with the personalization. Capabilities, of assistant. We could genuinely offer every, fan their very own on-demand. Commentator, so. Through, cistern fans could ask for the latest call from Monty they could understand how he had made that prediction and of course they get the latest team news and, for me this is more than marketing this is an entirely, new product experience, made possible through tables and, it's one that we are very very proud of and. The close of the campaign the impact of tables was clear so, we saw a hundred and fifty percent improvement in marketing ROI brand. Recall for Fox Sports doubled. And the, Fox Creek adapt delivered an increase, of a hundred and forty percent compared. To category competitors, so. The scorecard like that you, can be certain you haven't heard last Monty and I'll hand you back to Chris he'll be able to talk you through where we're going next. Thanks. Jack so. Those, are pretty damn good results, and. It. Frankly took. Me by surprise. Just. How well it all worked so, this, is not the end of our journey what. We really want to do is take. Monte, forward. And beyond what we've done this last summer not. Only taking it into, our, next summer but also start applying it to some of our winter. Sports so, we're looking at doing a, man, versus machine let Monte take on our soar panel of experts, what, do they think the results are going to be can they beat Auto. Ml, really. Excited, to see if they can all con we. Also want to be able to integrate it properly, on our air so on our broadcast being. Able to allow our commentators. To warn fans when exciting. Moments are upcoming don't go to the bathroom make, sure you stick around Monte thinks there's a wicket coming. Fantasy. Is also a fantastic. Way of applying. Monty's, learning model each. Week he'll build his ultimate team and can it be all. Of our consumers, all of our experts, week in week out by. Building based, on structured. Data, versus. The gut feel of our experts let's. See what happens I'm, really excited for the future it's been an amazing products, a great journey so, thank, you to everyone and appreciate. You listening to our story. So. Now, just my last roll, of the day is, to introduce Enrique. And Carlos from GMP. Seguros hopefully, I pronounced that properly but, I'm welcome, guys and thank you.
Thank. You very much. Well. Hello. Thank, you to Dean Yan and to Google for inviting us to share. Our experience, with Auto ml at this conference and congratulations. To fox for such so. Such an interesting case. GNP. Is one of the largest insurance, companies, in Mexico, the company was founded a hundred and sixteen years ago and we. Have yearly sales of around three billion dollars. As. Any. Large, and well-established, company. GMP. Is undergoing. A profound transformation, I, was, going to say a digital, digital transformation. We haven't, quite figure, out the precise definition of digital transformation because, everyone's talking about it so I'll just mention transformation. But, anyway what we want to do is to modernize our information. Systems, and our, operations, and we're leveraging heavily. On the cloud to achieve that and one. Of the strategic. Initiatives. That we're executing under. This transformation is. To assemble a single. Corporate. Data link that is a single, repository where. All the information, of the company will reside and besides. Being an important. Enabler, of efficiency. Because we don't have to figure out where to go and look for information, because everything, is going to be located or is located, already in the central data Lake, we. Regard. The data lake as an important, source of, competitive. Differentiation, and, of, us enough. Intelligence, and to achieve that we. Need to squeeze, knowledge. And value from, all the data that's stored in, the data Lake and to achieve that we. Have recently, started, to use machine, learning to, actually. You, know uncover, and squeeze the value. From from, the information and. The. Problem with machine learning is that you know. The availability. Of well-trained. Data scientists, is not, that great so you. Know Carlos is a head of machine learning in a GMP but. He's kind of a rare breed of persons because you know you don't have that many in the market so when. Google offer us the the, describe, to was the concept of Auto ml tables, and they offer us the opportunity, to test it and we thought it was absolutely. Great because, we want to democratize. The use of machine learning in the company lower, the complexity. Of generating. New models, so it looked very promising, and, what. You know we embarked, on on testing, and trying this you know this great new tool. So. We basically what we did is we play, with the tool with, three different, use, cases one, in the auto industry, in. The old Insurance line of business one, and the health care. And. The health care line. Of business and the, third was, general. Utility, for, our underwriters. For underwriters. For you, know collective, insurances, that will describe a little, later so the. First example, is it's. Really simple it's given the characteristics, of the insured car none of the, owner try. To predict, what's, the probability of that car of having, an accident and to provide.
The Highlights, of this you, know the technical highlights, of this exercise I'll allow catalyst to to share it with you. Again. Hello to everyone. I, will. Only explain, how our using our email tables to solve insurance, problems at a GNP. As. Erica said the first problem is car claim risk this. Problem consists. In determining and predicting. The probability, of a car accident, using. Drivers, characteristics. Such as age gender claim, history as well as vehicle characteristics, such. As type model, and intensity, usage. When. One thing that is really amazing about our models, is that you don't have to worry about future, engineering, or hyper parameter tuning development, issues. The. Only, thing that you need to do is they, want the. Made in. 10 unit and the demo and I, want to highlight that. This had a very, good quality. Confirmed. By the F score in the green square and, I. Relate. The the, next the, to the next problem. Enrique, introduced. So. The, second. Use case that we used to try out on mail tables, is we had previously already, developed, a machine learning model to, try to detect, fraudulent, claims, in, health care and. What. We did is you know we created a new model using our 2 ml tables and we compared the results to, the results are already provided, by the existing, model that we have and given. The characteristics, of you, know of the patient, which is our customer, and the. Sales agent who saw the policy, to this customer, and the specific, disease that. This person has and the hospital, and the doctors involved we, try to to. Determine if, that claim is fraudulent or not just, to give you an idea you know our yearly expense on healthcare, claims he's around six hundred and eighty million dollars so. Any small percentage. That we can achieve. In, improving, the, you know detecting, fraudulent, claims it translates, directly into a lot of additional revenue to the company so. I liked Carlos, you know provide you with the highlights of this exercise. Okay. Thanks Erika again. We use like, claim kay the, clients claim, history as well as medical, hospitals, information, and. One. Thing that is really neat about the. Outer Mel's tables. Is that you have all the performance, revelations, or the. Performance results that you need at a click distance you don't have to make additional, work in you know in order to compute evaluation. Matrix and I. Want, to highlight that this model had like, a very. High quality compared. By the f1 score of. 0.94. And we. We. Confirmed this quality on an independent holdout, set and and, I. Want to highlight that it. Had led twenty to thirty percent, improvement respect. To our existing. Ml. Solutions. So. Bottom line we were really impressed just by using auto ml we got. Twenty percent between. Twenty and thirty percent increase in in in in in, effectiveness, of the, Tecton claim so that was really promising, and I'm. Very encouraging, to continue using this this tool and the. Last example was a practical, example you, know we, not only ensure individuals we, also ensure, collectives. Of people that is for example all the employees, of a company so let's say that we're going to ensure. All. The you. Know all the employees of a specific, company so what usually, the sales channel, does is once that they do the sale they, provide either a CSB, or a spreadsheet, file, to. Our underwriting, department, so they can do the quote and the underwriting, of the, policy, and one. Of the fields that is required, for each, of the persons that integrate, you know this collective, of people is the. Gender of the person but, sometimes you. Know the sales channel omits to provide that information and what that happens, you, know the underwriting, department, they need, what they usually did. Before, was, break up the file into, you, know many small different files and then send. That file to a lot of different persons and manually, each person, will basically. Classify. Even, based on the name of the person who that person is, male or female so. You know the last time that that happened one of the last times that that happened, the file was 10,000, rows, long so it was a lot of work so you, know one of the persons from the underwriting, Department, approached our data, science team and said well you're supposed to be smart and. There. Should be a better way of doing this and it. Was a simple example what, we did is basically we, used our master, database as. You. Know as a data. Of the master database to. Train a model to, learn what's. The gender of a person based on its name and last. Name since its full name so that's, basically, what we did and. And that was a it. Was it says it's a small problem to solve but it's actually it had a lot of practical, usage, in the company because that this kind of happens kind, of often so, it. Ended up being a usable, tool and. Let catalysis, describe, justice this exercise.
Thanks. Again. Well, as you can see in the screen we we made a comparison, between a is, naive model, only using, the future, column of the first name and. Obviously. We got a. Bad. Quality, result. And we. Made a little, character. Level feature engineering and we decompose the first name into suffixes, and prefixes, and we gain a, lot. Of performance. As can, be seen screen this, this shows the power at hand that that, makes our email tables when. When. The Rivard irrelevant variables are are chosen and. Again. I want to thank the the Google cloud team for, letting. So. In this example by tweaking, a little bit the feature engineering we were able to and. Leveraging. On on auto. Ml tables we were able to achieve. A very, high efficiency, of this model so this, is just another small, example, of how, can, we use this tool and the. Results, that we saw in auto ml with you know with the use cases that we develop, a GMP are really, promising so we're really excited on. In the idea of leveraging on this tool to. Developing. More machine, learning models, much more quickly, you know we're determined, to use. Machine, learning to, solve real business, problems, or, problems. Are already, solved with some sort of automation giving them a much better solution and one. Way to achieve. This results, quicker is through tools like machine, learning and one of the things that we want to achieve is, currently. The. Underwriting. Of medical, healthcare goes. Through a process, and you know using a set of predefined, rules around, fifty, five percent of. All the new customers. You, know it's on the writing process is fully automatic. And. We want to change the implementation that we have we want to replace the static, rules with. A machine, learning model that would allow us to have 80% of, all the of. All the underwriting. Instances. To, be fully automatic. Only a 20% will be derived. To a set of experts, to the term in the unit the underwriting so that's one of the targets, that we have and on the other end since we already. Compare. The results that we got with Auto ml with, the existing, machine learning, model we have for detecting, frontal. And healthcare. Claims we, want to improve in, ten percent this year the. The amount. Of of, claim of fraud and claims that we're able to detect in order to contribute, to, additional, revenue, in the company so, that's basically what, we've done a GMP I think, this is a it's, a great tool and back. To you Dean you thanks.
2019-04-18 03:18
Thank you for sharing this informative video~!
Awsome. Well done
documentation on cloud.google
Where can I find a really detailed example of how to effectively label my data?