Code-Free probing of Machine Learning models - Pittsburgh ML Summit ‘19
Hey, uh some. Toga and I'm, a software engineer at a group called pair, which. Is people in AI research, and there's, part of Google brain and I, build, tools for. Like machine. Learning model understanding, and fairness, I do. Research in that area as well and today, I'm going to talk about what, if tool something. I'm really excited about but. Before, getting to that I want to tell you a bit about pair. So. Pair. Is this a multidisciplinary. Group. Building, within Google brain that does that has researchers, scientists, designers. And. What. We do is like we do fundamental, research in machine, learning then. We build platforms. And tools based on this research and then. Finally, we also do outreach and education, and our. Projects, usually deal with like moral. Understandings. Such as like, tikal. Then. We use visualizations, for, moral understanding, so we build tools like embedding projector, and graph visualizer. So. Then we, also use, announce to solve like. Larger, challenges, like, science challenges as well oh. Is. It okay is, it better now okay. Good. Sorry. So. One. Thing we did is for example visualizing. Earthquake. Aftershocks, and like. Trying to you know help. Like earthquake scientists, for example so. So. What is what if - what if - is a pretty. Much it fits in the scheme it's a it. Allows you to probe. Machine learning models without without coding at, all so. Let's get back to that so. In general when we. Look. At a machine learning model we can look at holistic, measures such as like. Accuracy, maybe, ROC, curves then, maybe area under those curves but. And, these, give us some insight maybe you're, in the ballpark of certain, performance, that's great but, there. Could be many morals, that perform. The same, in. These, like holistic, measures and manual, given to models, actually don't know, exactly. What the difference is if if I just told you hey both of these perform, 90%, accuracy right, so. We. Actually want to go a little deeper. So, we are interested in like human, centered machine learning models so we. That's, why we want to ask deeper, questions to, these models and it's. Really hard to do this with global matrix so. We. Once. We like. Go. Deeper, and have a better understanding, of the model maybe we can that can help us build fairy or like better, models right so. What. If - the main, goal will be that but. There are many ways. This can go so as, I said - model comparison, or, maybe you want to compare. Performance, across, different subgroups in. Your data set or maybe. You have you want to actually compare individual, data points right there's a lot of granularity and, most. Of the times to. Do such things like edit features, or like focus, on a subgroup and such you, would need to do some, degree, of coding and, it's.
It's It's not the best like it's, it's not super. Easy to do this in general so. Yeah. We wanted this to be done very easily like, minimum, minimum coding possible right, so, this. Is what you see when you open the what if - this, is its interface, so, the. Graphical user interface, is designed, to kind, of like iteratively and intuitively, like, keep, probing the model get some results, change something you, know, keep. Playing with the model itself on your data set and we. Took a black box approach for models so we don't care about your like, particular. Implementation, of the model the only thing, we care about is like the inputs and outputs this. Doesn't mean that it's limited to just predictions, you can still get attributions. Or like confidence, intervals, for your predictions, as. Long as it's defining your output and. One. Other advantages, this runs, completely in the browser so. You, don't need to you. Know upload your data into other platform, you can run everything locally, and. So. Because a lot of people have such constraints. So. When, we are designing this we. Thought, about workflows, of multiple. Like. Different kinds of people right so there, could be ml researchers, that want to improve the performance of their model then. There could be like, data scientists, that are analyzing, some datasets and such but, maybe there are also people such as like product managers or business partners that, do, not necessarily have, to know every detail of the model but, or like, maybe, they don't have, to code but they, can still go and analyze the model right they know they may want to look at. Certain. Aspects of the model before like giving, hey let's, launch this kind, of decision and the. Third group is students, I would say like or like people that are not ml experts, but they're trying to learn and. This. Will allow them to kind of like easily. Enter, this area because. Less. Coding know and, a. Lot. Of features packed and already, so. One, thing I like about this, is it's completely, open source. It works, in a variety, of notebooks, so it works in Jupiter notebooks Jupiter hub. Collab, Google co-op and cloud. Notebooks, it. Also is, part of tensor board if you are, not like. If you don't want to use notebooks. So. One. Advantage of, having notebooks, is those so intensive, board you are limited to tensor four models I guess but. In Python notebooks, the, way we set this up is you can actually provide, any function, any peyten accessible, function so. And. The. Tool we automatically, start using it which means you can have your scikit-learn model, you can put a PI torch model you can have your custom, implementation. Or, like, you can even have some server, somewhere and, you can like send, requests, to that server whenever, whenever, you know like this. Function is called. So, it's. Easy to use, just. Pick install with widget. The, ease, of it and. For. Notebook use case. And. Let's. Talk about some of the some of things I talked about right like what, is what if what I mean by what if, I'll. Cover, these at the end of the talk in the demos in more a little more detail but uh excuse. Some examples, of these. So, first, thing is model. Comparison, so, anything. You see, into. Any feature you have you. Can always have. Two models side by side so if you have an ROC curve you can always see two models side, by side on on that so, this allows you to kind of like make, the trade-off. You. Know in, making. Decisions. Another. Thing is we have a fairness, panel, so. There are some statistical, definitions, of fairness such as demographic, parity, or like Equal Opportunity and, what. You can do is once you have the model and prediction, scores with. One click if you if you choose, the choose. The group, or like slice. Your, data set with respect to some fun feature then. You can say hey optimize, my thresholds, for equal, opportunity, and the tool will automatically, go and change, all the thresholds, so that that condition, is satisfied. Okay. So. That. Lets you play with a lot of fairness metrics. One. Thing is uh. It's. Also this. Slicing, feature right so it shows all the performances. For each slice you have in your data set in this pane and roc curves and everything pretty much confusion, matrices so. Also. It helps you identify these. So if you are. Like. Sometimes, it's hard to say hey this. Group is actually important, so, let, me write that code and get the outputs but in this case you can't just keep slicing which is vector different subgroups and you, can always you can slice up to two features, so. I'm. Just see the results then focus on the groups if I may, maybe you'll find that your model is really really underperforming.
For Some, part of the data set. So. Beyond, visualizing, model performance, as like PR curves or a seekers confusion, matrices. We. Also have this which. Also has this facets. Overview, which, is the panel, you see on Drive where all the like colored. Dots and. Facets, that so, these. Help, you kind. Of set, up custom. Visualizations. Let. Me show, you what, I mean by that so. For example in this case on. The, right you see, I'm. Being. The, data points, with respect to like I have, two classifiers. It's. Pretty much a confusion, matrix in. A scatter, platform, because I'm saying classify one is correct classified, one is wrong classifier, two is correct classifier two is wrong these are the bins and the scatter plot is actually like, prediction. Scores for from coming like from both, classifiers, so the off axis would be like where they disagree. Okay. So. We. Pretty. Much did a did. Some research on this we're like about 10 months and we, found that like having. Covering, both sides is really very useful like, holistic. Everything, accuracy, at the same time going down to single data point and. But. There are there are some extra. Features on top of this one is a you. Can go to any data point and you. Can pick one feature and just edit it and if you click on run inference it will send it back to the model and you'll, immediately see the difference in this score so. You can just play around with your features without without doing, much work. Like. You can if you have a sentence, maybe there's, a typo, in the sentence, and you fix the type when suddenly the scores goes up and you're saying hey this model is actually all fitting to grammar it somehow and. The. Other side, of this is rather, than you editing features. We. Individually. Edit each feature we call this partial, dependence plots and. Then. You, show how. Sensitive, is this model. To. This feature and you. Can then actually sort those with respect or interestingness, like the. Most like the most sensitive, features these sensitive, feature I'll. Show these in the demos. Later on. Okay. And, one. Uh one. Other feature is this. Counterfactuals. So. Rather. Than when. You edit features, you are creating, artificial, data points right it may not exist, so for it because. You just put something, there maybe, it's not realistic. So, what counterfactuals. Does is it. Finds the closest, data point in, your in, your data set that is predicted, differently, so. When you click a data point and let's say someone, got a loan find. Me the person that has the closest, profile this to this person, but that person didn't, get the loan something. Like that and. Actually. What if tool allows you to define, what, distance. Means so. You provide that function if you don't provide that function it will default to something but you, can customize, it you can say hey, you know I actually want, to improve, someone. Someone's. Like. Let's. Say in this loan. Example. I want. To help someone get this loan what. Should, that person change, to. Be eligible for this, you. Cannot go and say. Like. In the general sense you cannot say hey, yeah, you. Can but it's bad you can't just like hey go change your occupation, that's a very hard task you should like maybe, there are some easier things that person can change in their profile, or application, that that will suddenly lead. To them getting the loan so. So. That's possible, with custom, distance. So. We. Have all the demos a lot of content. There actually and if you just Google what if tool it will come up as well but. Without going into like you know like. Continue with slides I think I wanna now at this point is more useful to go to the demos.
So. First, demo I have is the this. Model comparison. On income, prediction, and. Let, me open this up. Interesting. I guess. I just lost. Right. Okay. Just, lost my. I'll. Just open it here this. Okay, so this. Is how the tool opens this is income classification, task so, given. These features. We. Choose you like age capital, gain capital loss education, level and such things you want to predict if this person is making more than 50k or less than 50k and we, have two models loaded here one is a linear model one is a deep learning model and the. Off-axis, is pretty much where. The models disagree if. The models are exactly the same this would be just the line because this is influenced score one influence score two and, so. Like. For example if we go to. Partial. Dependence plots in this case and say. Hey, let lets show all the like sensitive, features right so sort features were interestingness. If. I do that. I. Find, that capital, gain is actually, very. Very, like, effective. Like it changes the predictions, a lot between, these two models. For. Both of those actually and, you. See this data point actually, has a capital. Gain of, about. 3k, and. At. This point model, 1 this orange line is the linear model and the. Blue one is the deep. Model, they. Differ a lot they differ so much that one. Model predicts the, linear model predicts, that this person's making more than 50k the other one is less than 50k but, this is so. Someone, that's. Making. That. Has more capital, gains you would expect that person to make make. More but. There is this interesting, pattern that says hey, if you are making 0. You're. Actually better off than, making like 3k, a year right in. Capital, gain so why is this happening and you can immediately go to features and, if, you go to features for, capital, gain you'll see that this is a very sparse, feature actually you can do like log scale and you'll, see almost all the examples, have zero capital gain there.
Are Some examples, that have a lot, of capital, gains and these are classified correctly, but. There. Are several, examples, maybe two or three examples, here. That. Pretty much have high, capital gain but lower than 50k income so. Because, we had those several, examples, in dataset in that sparse region, the mall over fit to that region I suddenly said you know what if, you have just. Around like to care 3k capital gains I'm gonna say you're making less than 50k, although. That's probably. Not like statistical, to if we collected, more data in this case probably, this will change so this this immediately, tells you hey you know maybe there is something wrong with your data set in. This case and. Other. Things we can do here is pretty, much we can go back to data points and we. Can say hey find, me the, closest, counterfactual. Data point to this ok, so. Not. Sure if it's visible can, you okay yeah. So. So. We found someone else, that's. Almost the same age like four years difference, the. Capital, gain is zero. Almost. Everything. Else is saying they work a little less a, week. And. Really. Everything else is the same and this person is classified, as. Pretty. Much actually. Making more than 50k, so it's, just this, capital, gain is messing. Up your model so. I'll. Show you the performance, and fairness pain in. This case, what. You see here is the for all the data points you have thresholds, for model one and model two you. Also have the Auto seekers, and precision. Recall curves and confusion matrices, so you can just see the overview of what's, going on and. If. You click here and, if you say hey. I, want. To slice by let's, say I, say. Sex in this case. So. You'll immediately get the performance, of male and female subsets. Which. Differ. For. This particular, model and. Then. You can say hey I actually want. Demographic. Parity, between these two groups can, you optimize the thresholds, for me and as. You see like that because. We. Want demographic. Parity cones during the threshold, for the male subset, increase, than the female subset decreased significantly so, there is some sort of data set imbalance, here some, something going on so maybe, you want to dig down into, this.
So. This is a this. Is for the, census. Demo, I. Have a second demo which is a. Which. Is smile detection. So. The reason I want to show this is because this. Shows you you can actually load the image data you can also load text data so. In. This case we have the task is pretty much predicting, if someone is smiling or not and, if. I click on somebody, you know I see their face image then, there are some annotation, features these are not used by the model in this case because model only looks at the image. But. You. Can add any extra features, into it for slicing purposes, or like grouping purposes, pretty much and. Here. What we use this, we. Actually loaded. The mobile net that's. Trained on image net we. Use the, embeddings, of mobile map to, compute. The distances, but, our smile detector, is completely, different so. When I say let's. Say pick some point, and say hey, show, me the closest counterfactual. It picks out an image that looks very. Similar but. That. Got classified, differently. So. In in terms of features, and you can you can change, your model. And. Rest. Of the features are similar, for. This demo and. Finally. I'll. Just show, you a small. Text, example. And. If. We if, you want to like play, with these demos I'll be actually outside with laptops so you can ask me more detailed questions later, on as well, but. This is an example where like this, actually. Tries. To give. In a comment it tries to detect if something is toxic or not and. We. Train the model like we load the data set everything, in a notebook, and then. In the end is so pretty much load what if tool and you, can do the same analysis, with text and for. This particular, notebook. For. Similarity, between two. Sentences we used Universal sentence encoder, so. You. See like it's kind of flexible it's hard to define what what. Similarity. Means between two sentences, but you can keep, changing the similarity, and play. Around and see see what comes out can, you discover something interesting. Okay. So. I got. The, 5-minute. Warning I guess is that for the questions or is this for the whole talk. Okay. Then I guess I can take questions now because. This will just launch what if tool and that's. It that's all I want to show here it runs, in a notebook. Me. Yes. Mm-hmm. I see. So one, example I, was. Thinking. Is pretty much. Like. That loan application, case. In. That case even. If you didn't look at partial dependence plots you could say. Finding. A closest real counterfactual. And you may find that one. Feature that shouldn't. Be that important, was really important, for example everything, else being the same if I switch the sexes and suddenly it says hey, you know what you're not getting this loan maybe, that's really bad you know so. Counterfactual. Let's you find these. Like. The clothes is most similar examples so when you see that most similar example, according to the model. Then. You are. Then. You can gain some insight from it pretty much. I see. That's. That's, true I guess, just by looking at single, data points counterfactual. It's really hard to generalize to. Anything like the whole data set or the model model. Is, biased. In some way because maybe maybe there is some maybe. That's just one odd data point somehow and there was an error or something but, I think, it it completes, the package so when you have partial dependence, plots you automatically, see which features are the. Model is most sensitive to so you can see some features that maybe the models shouldn't be sensitive, to then. You can play, around with counter factors to complement, that I don't think like. Just. Counterfactual. Itself is useful, to make. Very, broad decisions. Like statistically, significant. Decisions, but. You can't keep playing with that right so, if. You identify certain, group for example you can keep like looking, at the data points in that group and keep checking their counterfactuals. And try, to discover a pattern. It's. Mostly complementing. I think the other features. So, but, that's a very good question I, think. We get this a lot because this. Tool runs in browser in, the end so. If you want to load like millions.
Of Examples it, won't. Be able to. For. Tabular data it can load. Like. Tens. Of thousands, so twenty thirty thousand, is totally, fine depends. On the size of your features of course if it's a huge feature you, know your memory won't be enough to load it in the chrome let's say but, uh the. Thing is uh I. Think. If. You. Have million. Data points and. If. You somehow slice, that data set or sample, maybe you already know that some part, of this data set is problematic, maybe. You have, some underperforming. Part and you. Focus on that like based on accuracy you can check, maybe this section hey when this feature is that okay. This doesn't work then. Then. You dig down into that so this tool is mostly for like more detailed, analysis, and to. Discover patterns without writing code, like. Otherwise you would go and change, this feature like, change this a little bit and keep like plotting things so. This lets you do that like speed up that analysis but it won't like, it right now cannot, handle millions of examples. That's. True. Yeah. Mmm-hmm. So. I, guess, the the. Main part there is the so. Right now we cannot handle seek to seek models we can handle seek to classification, models, if, you are talking about machine, translation. That's. On our radar we. Are planning to you, know make. Text models work more, smoothly. Customizable. In a sense that you're, thinking of like, taking. Parts of this tool and using. Them differently. Or. Like. What. Kind of customizations. You. Can, you can load your data set even if it's text data set right so you can you can just point your data set and it will start working and, we text actually you can you. Can if you tokenize sentences. Each. Token, will be a separate, feature here and we, can handle varying, length features so. You would be able to go and edit individual, words and. Then. See how the prediction, changes. Again. You can use the custom. Distance, to, define like similarity. Of two sentences, and check out what happens and the scatterplot spinning, all sorts of stuff so, everything will work with text actually. Though. The. Only limitations, is if you are working to seek two signals, right now it won't work if you want to load something. That has like. Million. Classes, in the output like each word is its own class. Things. Like confusion matrices, will completely. Fail right now but. For. The general task of getting a text and predicting something like hey is this comment, toxic, or not it. Will work right, now. Actually, I didn't show you in this in this one like in this demo you. Can go here and. You. See the text here and you.
Can Edit this text do whatever and, run inference again and all predict. So. You can actually edit text, here. These. Are terrible comments like this is taken from Wikipedia. Toxic. Comments dataset so, but. Yeah, I mean you. Can make this happy, and predict. Again and. You. See that the prediction, prediction, says, it's. It's. More toxic, now when I put happy, so. Interesting. Yeah.