# State of Unsupervised Learning

You. I don't. Think many of you need an introduction to challenging Bhattacharyya. Faculty. At Indian, instead of science and you. Know a very well-known name in the machine learning space in India and abroad said, without much further ado -. Thank. You, so. As. We said now all talks, in this session should begin with campaigning. For Vijay, so. Let me do my bit and. I'll just end with that so we'll do that. But. But. Having. Got that out of the way now I'd like to thank the organizers for inviting me. So. The. Organizers asked. Me I mean we Jane or Satish, asked me that. That. Can you give overview, of, things. Which you'd like to do like unsupervised, learning. Then, I, didn't ask them but they say 20 minutes. So. Still the art in unsupervised learning you can talk for hours of, course all scientists, have to talk for hours so that is the thing. So we teach topic slightly it's this state of unsupervised, learning so here I would like to confine, myself to broad comments you, don't have to agree to this if you disagree, please put up your hand we'll have a more meaningful conversation. So let me write, this talk like that so. Yeah, so, I am zero and. Today. We are going to discussed it was unsupervised learning now. See. There. Has been a, lot. Of talk about. AI. Being. A big disrupter. Right. Recently. We. Had a very, energy, a very. Exciting. Debate over, the internet when we heard someone the similar person that is like Elon Musk and Mark, Zuckerberg. Had. A spat about AI is good or bad but. He almost made this comment that I did presents a fundamental risk, to the distance of civilization. Now. Even. In the morning session we heard that governments, are worried. Right. So and I think this question requires. Some thought I mean I don't think alone is looking for publicity he already very well known and. Also Mark Zuckerberg is also knows though Ellen, says he doesn't know I but I think he so this guy also is very smart so. Having. Said that then, this wonderful. A moment and ask what is the eye. So. Now yeah as in the previous morning, session, there were some discussions and in a recorded slides there were some, definitions, of AI, to.

Be One definition of a is the. Ability to learn from experiences. This. Is one definition if we agree to, that see. Then this brings out that. You. Know so. You. Look at the talk which we just heard this amazing, exciting. Results. In ranking. And. Recommendation. Etc right but. The fundamental problem is the premise. Is that I have some labels. So. If you were to learn from experience who gives you the labels we don't often don't, have labels and that. Brings us to this thought that. Maybe. Unsupervised. Learning, is. The way to go if. You crack that then you mastered AI and. Now here is some good news and bad news if. You, believe in e-learning masks, catastrophic. The, theory of you know everything, is going to be under. AI and we'll all be servants or whatnot. What. We find this and so as learning is very hard did. It got 3040 years before he can master this so. Here is computer science way of resolving. This issue that. Yeah. Answer was learning today seems, to be intractable and in. This following, slides I'm going to express, some of these views and say. What, is untrackable about it and why is it intractable, like that, but. Having said that there. Is phenomenal. Progress is happening. So. So. We can only say this that. It is integral at the moment in, future we're not sure. So. Having. Said that so let us have this. The. Usual spiel of supervised learning so, basically, we. Can take the first talk the, talk which we heard just before as. An example and then. So. Go. Forward with that so we, are often given example. Let's say the diamonds and the Plus. Right, and so. If we have such labels and say that next, time you show me an example tell me was a diamond or a a-plus all of you who have done a basic, course in ml will, know okay so we can find some you. Know some separation between them maybe, two lines you can draw like this and one. Side of this line there are pluses, in other side of the diamonds right and so. This has been the basis of this very famous SVM, algorithm etc. Nothing. For a moment and, say, suppose. That don't give you the little buddies give you the data. Now. What can I learn. Now. Here is a question. Even. See the question is imposed at so many levels. Why. Just give you this data and said what do we want. Now. This is I think a, fundamental, problem where on service learning is grape, language so. One. Very. Like. We would not say a mathematical, principle, but a philosophy, behind this is see, maybe should find structure, indeed oh, now. That's a very ill posed question so, we'll see how far this goes in. The rest, of those things, so. This. Equation, to me is. One. Of the fundamental, reasons of success between of. Supervised. Learning so. Here, let. Us assume for a moment we are we are in the labeled world so where why are the labels and X's are the your. Data points right. So, now you have to be the function where, it takes in X and outputs. To you or Y, now. I want to say that if X and y have the same distribution one. Can say that see. This. You. Know what is the property that you don't dig don't, agree that the prediction would given out by F it. Doesn't match Y right. So, this is sometimes called a generalization, ability, they, will to generalize able to see unseen examples, now. So. This is sort, of given by so. All you give me some test examples, right so, the text examples, and the test examples, what, you do is you. You. Know you measure your. Test set error this, I can do right. Now there is this, bound. So. This bound says that, your generation error is your, test set error. Plus. A, function. G. Of H where. H is often so few linear classifiers, H is often called the VC dimension you. Forget for a moment but. What it says is that H is a measure of complexity, of the function class. So. Essential, idea is this then that. You. Know you try to fit if, H is low you, know so how is this is zero then, this would be cool and. H. G's, increasing, function of X right so often. So the most mischievous story is get. This low training, set our training, set error low. With. H being low, so. Try to fit, the simplest, possible function. A classifier. Which. Gives the best training set error you. See so that is the guiding principle, once. I have that then, then I say how will I control a each or less control you. Know the strengths at error this that will. Go I. Imagine. Think for a second I don't have why now where is this what what is this side I don't, have any this right, so, this is this problem, now so, even there's no formalism. Didn't think about. Right. Now. Having said that we'll see some examples now, unsupervised. Learning is also making great waves and I think that is the next wave in, machine learning so. Existing models are often intractable yet works very well in practice. Right. So. Now the. Need of the hour is however, as we, will see that. So. There are some recent attempts, which, which will try to make, this. Problems, tractable. But. The problem is they don't scale they. Don't work for large data and. As. We said that main premise is that they in the world there lots of data but they're not labeled so.

Any Answer is learning to be successful, should, should be able to work on large-scale data any. Formulas you I put now, here are some things things which a heuristic works, very well but. You, know you're not sure, you. Know is it giving result and often people who worked on service learning will tell you when. I went attuning is a huge, challenge of course also they're in deep learning so. So. The need for, algorithms. Which are not really scalable, but, that should also give statistical, and algorithmic guarantees, this is the need of the and, I, don't think we are there yet so we will see some examples. As. We go along so, this, so. So. You have chosen on service, learning has I said there is no fixed. Structure. Or. A paradigm, to think about so, many people think many, different ways so. One of the ways is that ha. Maybe, I'm given lots of features. Can. You bring, it down to a very few features, which, are often called dimension reduction, so. One of the dimension reduction is that ah, you. Know you, beta is there you know you should go ahead and look. At Rishon of maximum variance this. Often is known as PC a little, bit understood. Now. Nothing. For it so, there is a variation of the problem what is it okay if I do pca what I am finding is that it's. Giving a direction but, which is a mixture of all other features the. Price I pay is. I. Lose interpretability. Interpretability. Means I must able to tell you in terms of features huh, so this. Gene was active that gene was active and these three genes caused this so, that's interpretability, but, if i say this in this subspace it's like that does it doesn't often is not very interpretable. So. Variation of that is hardness, called a sparse PC now, that becomes np-hard, problem. Now. There is this another, very. Famous problem, and I think is one of success stories of unsupervised. Learning. You. Have gone to a, cocktail. Party all. Right I don't. Know why I asked you a cocktail party but. Here I guess you. Don't, listen well I guess because of the drinks or alcohol or whatever so. Imagine, there. Are. Three. Speakers, and. They are talking as each other and. There. Are three microphones. Right. In, the party, you. Can all have meaningful conversation, but we're all talking this one another. Right. In this but, in the speak in the indigent party here you can definitely make out who's speaking what let's. Do one thing but. Let us now make the recordings, available. And. So the recordings can automatically, find out who, said what. Apparently. It, becomes a huge messy problem and that's not the biological problem also so. Is a very formal cocktail party problem now, that has been solved. By what is known as independent. Component analysis, now, here it, seems PCL, looks for correlation, or, intimate computer analysis looks, for. Independent. Directions so where the search is for non-gaussian etcetera, etcetera we will not get into that but the main problem is this, is again or again. Another intractable problem, to find independent, computers.

So. A simple version of this problem you see is getting hard. So. A major success story in unsupervised, learning is. Probabilistic. Models, and, probabilistic. Models, often there's this problem of inference where you have to figure out that. What are the parameters, and. And. Other. Other things, these. Are often, become intractable. We. Said this intractable too many times. But. Let's start with a very simple version of the problem which all of us who. Don't ml, who. Have gone through ml course you, learn it in the you know in the first few lectures is. Can, you find groups indeed. How. Hard is that problem. So. For example the the picture which we draw in the in the first few slides and. I can stick the slides and say ah look. There, may be two groups like this right you. Know, but. You see it could have drawn another, group here right and, this. Is three group another group so. So, how many groups if, so. Who. Goes where right these kind of problems is also not, very well, understood to begin, with so. Yeah. So. Now what we will do is taking this problem we will look at three, specific instances. One. Is clustering which we saw on clustering, one of the very few most algorithms is k-means algorithm. How. Good is it does, it doesn't that doesn't do anything interesting, for, that matter. We'll. Talk about, non-negative. Matrix factorization, we'll. Talk about topic models very, quickly. So. So. Clustering has many uses I will not bore you with that but probably one of this one interesting uses where people, look at gene expression data and said ha, so, you, know you can these these forms the, red to green are different kind of clusters as given. Here. Is. To motivate, the. Examples, of clustering so from understanding, cancer biology, which was the previous slide to, here you know hard business that can I Sigma at the markets so clustering, has many many uses I, mean. So. Now let's come back to our problem, what you said so. Clustering, is trying to figure out. Structure. Indeed what kind of structures the fine groups okay so what is the group group is nothing but a partition that is you, know your, data was there now, can you partition them into you. Know some k number of sets right. And which example, goes to any just one of the sets and the question is can you find those partitions so, one. Way. To partition, them is that, ah maybe. Each partition, each set, in the partition has a particular, cluster, center maybe the most representative, guy and you, take a new to take, a data point and say which, is the most representative of them and you measure in terms of this distance and, you, get what you get, all. Right so, this is a very famous. Clustering. Setup. Sorry. Now. There is a very nice algorithm the algorithm is that you. In the first go when you have those representatives, you. Assign them using. This that, we assigned them the nearest, cluster. Center and then you update the cluster centers right using, this simple, formula this is sometimes when the k-means algorithm. 1957. We. All know it works very well also the, question, is very say it works very well it said what in works very well I took data I ran it it worked.

Can. You know what can you guarantee with this the. Problem is if you if for now what you do is when. You say it works well often. If we just see the data little bit you see it doesn't work well. So. Now this does not seem to be that simple, question to answer that when will it work and when it doesn't work so, often so, this has led to a very phenomenal, interesting, work so. So. People thought about this right if. You assume that each cluster. Each. Example is coming from a Gaussian distribution. With. Certain mean and the variance let's. Say but. If the means are well separated. So. Mu1 minus mu2 are the means mu L minus mu where is the means and if they're well separated right that's a lower bound in terms of the variances, you know then. We have something, then it then it can do but. Then then, it says if you believe in this argument then you're saying what are you saying you think ah the distribution must be questioned oh it. Doesn't work, and. That's what it's saying that's what our understanding is so. Then a very interesting set of conditions, came up by, Kumar and Kannan so, here where, they're saying is that see, you, can you, don't have to do this probabilistic thing, but. What. You can do is you can assume that if if each data point you. Know. Lies. To. The closest, mean along. A line, joining. The. So. Let's say you. Take a data, point X and you take two other means, master, centers mu L mu s if X is close to me well then, it would be closer, along the line joining this now. Whatever it is, you. See this. Particular. Assumption. Is. That probably, stick. It. Did not say in distributions, for that matter right. So this was a very good progress and they, could derive and they, could show it has more a more interesting part they. Show even then k-means. Didn't work, what. The said was that. If you want to really recover the clusters that is the T else, you. Do SVD, first then you run k-means. So. Even today, algorithm. Like k-means, we still we. Can only see then it works well and things are very very well separated and. As. Things are closer and closer I need to do something that's like you do a speedy and, like that and. And. People, who are looking for smart exciting, crawlers smart. Guys new exciting problems, ask, the question some of they do not know K. That's. We, don't, know so, we make some comments at the end of this class. About. This so, next thing as I said we will look at matrix, factorization. Now. So. Again so one of the matrix which Malik alluded. To for example the. Rating. Matrix right so, D, can be the number of items to be rated and n can be number of customers. So. Now if you, can ask the web company guy he will say DS like million, products is quite common how. Many customers a, billion. Customers. This. Is reality I mean this one when making this program. So. So now how do you Packer eyes these things right so a equal to BC is this factorization, now, the interesting part is. The. Interesting part is what people have found is that. Most. Of the entries away all right let's say all increase away are positive, not negative they can be 0 or not negative now. If we say B and C also has to have the same on negativity. Then. This problem is interesting, from a practical standpoint but. Also very hard. Number. Two why non negativity know, that part is not very clear but Li and CEO in 1999. In a neighboring the interesting feature paper showed that. Maybe, it compresses, philosophy, that maybe these. B. Matrix then will. Help us decompose. Into. Sub. Parts so we'll see an example but, so. This has found, applications, in text mining hyperspectral, imaging and, outputs, of other things so. The interesting thing is this. Problem, is. Again. Intractable however. There are fantastic.

Algorithms. Which works pretty well but. Cannot, guarantee you explicitly, that did, I did ref and write B and write C, so. Just, to be clear over, the example which I said right so, now if. We run nmf. Or non-negative matrix factorization on, a bunch of images so, I do not know if you can see you can see this is like the eyes you. Know there is something like the lips. This. There's something about the you, know eyelashes. And, all that right so is indeed finding, some decomposition. Which. Is like meaningful, right that is the reason why I never has become so popular and, so this is the or B matrix and this is our C matrix which is the multipliers and you see this if an, original, face can, be represented. By the C matrix where, B is my basis and it, can recover for me the. You. Know of, image, which is pretty close to the original image so, nmf that's why it's called so. Yeah is also. Contributing. To english language there's. A new word is fine now it's called feature Iser. So. I have data and now B is your features and C. Are the coefficients, right. So this is the future Iser so, as you see this is very interesting here now, the question, is. But. When does it give the correct decomposition. Now. There's some first good news is. Np-hard. The, simple similar problems right however. However. So. Let's say so see, it's. Very useful to say is np-hard I mean, because. Now. Can, you make some assumptions on which which is realistic yet. You. Know can you say okay in this case is not NPR if you have to assume some more things so, people said this is nice idea but separable, ax t let's, say that if B so. B matrix right you know this is like identity matrix here in our words all, right and rest. Is this can be anything so, that is there. Exists at least one. Word. If. These are the words and any of the documents let us say or B's, or let's say these number of items as we said but if there's one item you. Know which. Which. Does not occur which. Does not have any, let. Us say the entry of the, B matrix is 0 for, in the for. The other in other features right, or, rather let's call them factors, for that matter see let each column be a factor and say that one. Item only there, are some special are items which, appear, each, of them appear only one of the factors right so, then have a something like be equal to I R by B 0. Under. That condition this was separable ax t. It. Is possible, to find the practicum position now. Even. Algorithms. For this was also not known this, was in 2003, right. Only. After almost 10 years. This. This. Nice work came up where, this algorithms. Based on simply shell factorization. Was. Proposed another, point we can say here, is a provable scheme of finding, an animal right now, that was pretty slow that was worked. Upon and the new and, interesting. Algorithm. Was proposed called hot topics, but. Even then it needs to solve an LP you know n square variables. What. Do you mean by that. Long. Story short it will not work for your, billion. Cross million things it's. Too slow, doesn't. Me said you want to say right is, not today. So. Now here, so, what we have been thinking about this problem and said, ha. So, let's talk about the, other thing we observed that. Existing. Algorithms, like, we just said right though the proven and all that but. Another, problem with this they're not at all robust that is you just add a little bit of mice a whole, thing go through so. Now the idea is that can you develop, algorithms which. Should have they. Should be robust this should be scalable they should be proven. Now. They, here I mean, will not worry to death but. We could we have an SPD, best nmf algorithm, which. Which. Can tolerate little bit a heavier noise so, by this annoys they can tolerate that is each, column. Of n in. Is noise matrix, right. It's, like Epsilon right in, l1 norm some, norm, whereas. We can tolerate. On. An average it should be epsilon that is some of the ends can have not general than epsilon that's the point that, we call heavy noise right. So, this is one of the things which we were, doing. Now. Let's quickly. Jump. Change, tack. Okay. So I think, my time is up so before the so hence, in the next two minutes what I want to do is quickly, tell you about a, more. Interesting a. Different. Area, which. Is often known as, latent.

Variable Models so. Here. Written. Variable models often say that there are observations, and there are hidden, variables right and often written like this, now. This is a very popular model, your. K-means algorithm can also be incorporated in this hmm. Is can be incorporated in this you. Know your belief networks can be invoked or corporate in this. Now. Now. To run such algorithms, you need to do eeehm each, person maximals in this one one, broad paradigm. However. Now. You say that if I show you enough examples and. I can you recover the parameters, and that's. NPR. Of. Course, there are patient, ideas and MCMC methods for that do. The work well but they lack, these. Guarantees so, what we have done is. We. So we can exactly you can use this one, particular, instance and I will take just two minutes and, then do, that so. One particular example of such a little variable model is, what, we call as topic models. So. Now now, think, like this that, we, need to understand what a document is all about, now. If I say a document, contains the following five words run, inning, hit season, game what, is the document about brick. So. If I say document contains about computer, software system, Microsoft company, the. Knowledge that was not to publish as Microsoft name it came already set so, of course you talk about computers right so patient drug doctor cancer medical so, you see you, didn't ask I see the question was to what document is all about you didn't ask for the verb you didn't ask for the now you didn't ask for you know who wrote it. This. Is five words told you something, so. The question is that these are often called topics, right. So, now this has been little sport and that's computers, etc right. Now. The the the interesting part is. Fine. So, just quickly good yeah. So. So. It is very difficult to. Discover. This topics even this are NPR, so. What we have made is of course building on the work of others especially, a super ability assumption we, have been able to. We. Have been able to come up with an SVD based algorithm. The. Amount of time. To. Say this. I. Just, put the last slide show, this right. Now. Now one, of the questions was that so, many, of you have used this probably this algorithm called led. A little. Equation now. It doesn't guarantee you or. You can ask how many samples do I need to learn nd this, has been asked by industry beautiful ok. Another. Point is I don't. Male deer but I set my kid to be 100 and they all think of garbage, break you'll 50 work well so.

How Do you solve is all this problems right now we, have come up with an algorithm. Which. Just requires SVD which means it can scale well and has, and, it can recover this if ms or the topics can come you know mmm hats are pretty close to that right, now currently, we are in work with Microsoft we're trying to scale, to like a billion tokens and. Why is it possible the point is that. SVD. Is a very interesting algorithm. Which, has been worked upon and can easily scalable. Now. I just so. So. You just made a comment about finding. Sorry. Finding. That, K thing right here. Also in, both the problems we present nmf and our, topic. Models I still need to know the number of topics, now. If you want to ask. What. If I do not know the number of topics that's the reality case, right. Now that is sometimes called a non-negative rank. Now. If non-negative rank was equal to rank then things would have been easy, but. It would test that in to test the tronic when is undergirded i equal to rank that's. Also np-hard, you. See life is also. Also. Non-negative, rank, has connections, to you. Know a very, very deep. Promotion. In, communication. Complexity. Non-deterministic communication, complexity. An. Important, quantity incremental optimization, so, the so point being I don't see this kind of problems get resolved, very soon. So. Long story short in summary. Existing. Foundations on service learning algorithm. Lag, guarantees, but. We need them if we, want AI systems, to be you, know more useful in practice we will ask demand, these questions, but. If you answer that we. Are hitting against very hard problems in theoretical computer science, so. So. I think most. Probably learn musk is wrong 11, next 40 years yeah, I will not going to take over I'll stop stop. Here. I'm. Not very hopeful note thank you Chiranjeevi, and I will take prolly one or two quick questions yeah mic, question. Here first. Yeah. Very. Nice truck thank, you one. Quick question so the idea of being robust to the noise apparently. That the problem is there across the machine learning even deeper deep mode I do suffer so. Is, there any general approach that we can take I mean from an optimization, perspective or from a training perspective how, can address this problem okay. So there are two two, things. So. At this point. Clearly. We have to separate out service. Versus unsupervised, okay. Now. Per supervised, for. Example, there is an area of work. Which. Touches upon robust optimization, for, example if you can look up my sometime old papers, also has done this for example it has missing values so, what can I do so there are techniques.

For That how, the same techniques, will not apply for the problems in unsupervised, life or maybe, you do not know enough to unify, these formalisms, so. Each so, now we have pruned up to our if you are some general questions probably answer is will say, no I do not know how many way but, if you say for. Nmf you know I see noise in a what, can you do okay, there may be some hope like that but, the same procedure will not have may not apply to let us set up in models, all right, changes right always we do not know enough. Right. Okay one. More question here yeah. Thanks, that was an interesting talk you. Talked about the. Non-. Rank of being, np-hard. I was. Wondering if there is any way like in practice we need two non- factorizations, and then, what's. The general guideline, to tune, this rank parameter because we. Work with tensors in in which case the rank itself isn't np-hard. Exactly. That's something so, now at, this point there are no there. Are no guidelines for, so you except to say try. Out everything so. So. One option is that okay, for sure we can do the following thing. Non-negative. Rank is np-hard, but. You can compute rank and. Rank. For sure the lower bound or non-negative n so. Whatever I should choose we should be more than the rank is, the first thing, right, so. And what. I am trying to say is that beyond what I've written that is no, further insight, at this point. Now. There is another very exciting line, of work which I didn't get time to cover it's called Bayesian nonparametric, s--, right. So they're using. Suitable priors, one. Can maybe you hope to get a better, estimate again, but it does not give you a guarantee, that you. Know does it have this it is again kind is a very very loose most, of them the nonparametric models are very hard to take, a lot of time. Under. Tune the parameters there. Is yeah, and the reason is reason, is that you. Know the guarantees are so weak you know we would, like if, you ask which one should ask question like this how many samples will I need would, I recover what is my error so, unfortunately, the mechanism, is, cannot. Deliver you these, results because, their. Main thing is guided by in, expectation. This is what you get which. Is often not pretty useful.

*2018-03-24 14:46*

__Comments:__

*2018-03-26 16:27*

pretty pointless out of context

*2018-03-27 07:32*

Some unsupervised learning ideas: https://discourse.numenta.org/t/similarity-alignment-a-missing-link-between-structure-function-and-algorithms/3683/20

2018-03-25 07:24Is there any good algorithms to estimate non-negative rank of a matrix?