NW-NLP 2018: Compositional Language Modeling for Icon-Based Augmentative & Alternative Communication
You. Vizor. I'm going to present today. Compositional. Language modeling for icon based augmentative. And alternative communication. The. Child you see in the photo has, cerebral palsy. He. Is using an a and, augmentative, and alternative, system. An AAC. System. To. Communicate. With, his surroundings, and basically. To overcome, his speech language impairments. There. Are two types of AAC. Systems. Either, text, base or icon. Based, this. Child and the focus of today's, talk would be on an icon-based. AAC. System. The. Way he, does it is by, choosing one. Icon after another on, the, way to composing. A message. For. Icon based AC systems. There. Are mainly two approaches. It's, either, you work, with hierarchical, AC system. In which a user navigates. A tree on the, way to finding, their, icon. Alternatively. They. Can use a color-coded. Board. In which. Groups, of icons, represent. Grammatical. Functionality. Say. The group, we, read of a group, of red icons, represent, verbs or. Those. Greens represent. Prepositions. The. Problem however is, that when, we scale, up the vocabulary size, of the these icon, sets. They. These, approaches. Fall. Short in. Addition. Depending. On the selection modality. We. Might accumulate. Many. Errors because. The user might require, to. Do a cascade. Of choices, until they reach their target. Therefore. Naturally. We. Thought about why. Not creating, icon, language model based AC, system. One. Thing I will say is, that there, is not much much, literature, in the, context, of, icon, and language. Models to the best of our knowledge, in. It but, we did rely, on an earlier, work. That. Try to assign. Semantics. For icons. One. Additional, thing I will mention is that. That. I forgot I'm, sorry. Yes. That this. What. You'll see is a unique. Problem will you'll, see what if you slide. That. While. We do we are well, familiar with many, types of language models we'll. Introduce here a unique problems, that, most. Of you probably haven't, heard of by. Now. So. To, back to our language models of icons, together. With our clinical. Clinical, team we've, decided. To work on simple, six icon, dataset. It, is, used by communities who are in need for icon. Based communication. Not, only cerebal. Palsy, individuals. But also people who may experience, traumatic. Brain injury or ALS. Or locked-in. Syndrome. It. Is human created. Containing. 34,000. Icons, 5000. Of which are of single, words the rest are phrases, which we plan on exploring in the future, we, ended up using. 3000. And 500. Unique, single, words let's. See an example for, an icon. So. This is an icon and it is associated, with it's made of data. Different. Fields with we can see our innate it's named word. Type and synonyms. Notice. That among, its synonyms it also has phrases. But. More than that it also has, a, synonym. That suggests. That maybe this icon. Represents. Slightly a broader, concept, that, perhaps what you and I would think about synonyms. For, the the. Term agree. One. Last thing. There. Is no corpus, available. Of this. Icon. Set. And, so. Then, the question becomes. How. To create, language models, for corpus. Les symbol, set. So. What, we know for sure we don't have is, an actual corpus, but, what we may, have and we know about is that we have this metadata we. Know about pre-existing, textual. Corpora and also. About another. Tool of word. Embedding, representation. So. Maybe we can, utilize. And incorporate, these. Tool tools. To figure. Out how to simulate. An icon, language. Start. Let's start with icon representation. Where. Type let's. Take, the example of the same icon and retrieve. Its name, and synonyms. We'll. Go to a pre trained. Embeddings. Take. The all, those, terms. And tokens representing, this icon, compose. Them together and, now. We've represented our. Icon. Also. Notice. That it captures, the broader, the, concept, of the icon, as opposed to what, we know about agree, so. It represents better. This. Icon. And. So. On and so forth that's, how we, make. Our icon embeddings. So. This is this path and. Now. We. Want. To incorporate. These. Embeddings. Into. Textual. Data set that will provide us sequences. So. Say we have a sentence. Such as you, agree to solve problems. We're. Going to look into our icon embeddings, dictionary, and one. By one, we'll. Be able to simulate. Our, icon. Language, from, this textual. Data. Set. Now. We have embeddings, in sequences, and we are ready to train language models on them.
We. Have employed a. Standard. Python, architecture. For language, modeling. And. Then. We have performed, it in a five fold cross validation. Manner. And we took different I'll. Just remind you we've, employed a simple, Python. Standard, architecture. For. Training language, models in. A five fold cross. Validation manner. And we, took. Look, at three, different, measurements. To evaluate, our language. Models one, was, mean reciprocal. Rank so the model produced, for. Each prediction. The model produced we, wanted to look at the rank of the target, another. A metric, was, accuracy. It was at one, meaning. Let's. Look at the first guess what the model had had done and see. Whether the target was in or not, another. Metric, was accuracy. Attend looking. At the at. The, first, ten guesses, and. Looking. Whether the target was in or not so now. It's time to evaluate, our. Our. Language. Model or our. Sorry. Our choices throughout, the process so the first thing we're going to focus on, and imagine. Now that you can see the. Process. Have. You all recall. What the process, was we had pre train embeddings, here generating. Icons. And also. Both of them go, directly, to the textual data set and in, the bottom now. They're, being translated, to embeddings. Okay so we have an embedding, data set and we are training, language, models on this embedding, data set so, one experiment, we have conducted is basically playing with the pre-trained, embeddings. We have used a glove, containing. 400 entries. Trained. On English PD Award and compared. It to context. To whack that was training that. Had. 160. Entries. Training British English web-based. Text, for. The textual, data set we. Have used sub, likes us. Which. Was the best we could do it contains, six million sentences. And was, a good proxy for, AC. Type conversation. Not the best one we, actually ran another experiment. That had a more, AC oriented. Corpus, and. Had its result but it, results, its, result for, mine its, own thing which. I cannot, elaborate, now because we don't have time, and. Finally. We wanted also to play. With another. Parameter. See, whether we can incorporate, just, the icon embeddings, without. The pre-trained into, the textual corpus and learn language, models from, so. We compared, the pure, one, condition, to a non pure containing, both, I'm. Glad your imagination, is working, with, me. So. As far as results, go, we. Basically haven't, found, meaningful. Difference, between, the. Two types of the pre embeddings, but we did, find that there is a clear, degradation once, we use just the pure icons. And. We, suspect that this, is because. This. Is because using. Just these icons, did, not allow. Us to have a good coverage, of the data set I. Also. Wanted to give some provide, some examples for, these. Icons, maybe, I'll I'll have a chance to just show examples. In the last minutes I have here. Maybe. Not I. Will. Get. Into the conclusion part, and. Say that the, biggest limitation. We. The, biggest limitation of, this, approach is that basically. We. Are simulating. An icon language to the best we can. But. It's still from, in written. Text. That. Applies its own constraints. To. More so and now it's this the. Second, part of the future steps we. Plan on evaluating. Our language models on real, users. At. The first step on healthy ones and perhaps, later on on, those. Who we need for I mean also. Plan to have representations, for multi. Sense. Icons. And multi. Phrase icons. And we. Would like to thank our clinical, team our. Northeastern. Team as well. As our funders, and. That's. It. Oh. Yeah. While we're waiting to get. The slides back any, questions. Yeah. So is there currently an effort, underway to collect. An icon-based porpoise. Trying. To push for it because. This. Limitation. Like. We do know that there's. A good chance that a real icon language, would not go. Under be subjected, to the, constraint, imposed perhaps.
By Its corpus. And. We're. Pushing for it but we, would like also, yeah, there. Is no one that I know of. Sorry. Could you say more about how you got the AAC, appropriate, text. Corpus what. Was that from so. The. Simple. Six data set which. May I may, go, back to. Was. So. It's. Together. With our clinical team we have we. Have a license, or a contract. That. Now we were. Able this is a commercial, a commercial, company and. Under. The agreement we're, allowed to use their icon. Set. Maybe. I didn't. Thought. You said you had a text corpus which would be like what people will say in the AAC situation. Where'd that come from thanks for clarifying it, there. Is actual, or research, conducted. By, a Virtanen. L and I can also give, you the exact paper, in. Which I, have, 6000 sent. He gathered, 6000. Sentences, from, Turkish, who, were acquired and their constraints, to produce. AC. Oriented. Sentences. But also evaluate, their, peers sentences. But. It's only six thousand six thousand. Six. Thousand, and, it's not enough so we had to pet it with additional corpus. To make, more meaningful results to train it on. Has. There been any, research, done on. Comprehension. Of an icon-based, communication. Well. We, do know that, now. I'm talking in. The context of electronic. Usage. But. We do know about clinicians. That, have. These. Communication. Boards. That. They point to and try. To teach. A child how, to communicate, once, they have disabilities, so. In that regard it's clinically. Being. Deployed, or or the, practice, do. You see that that how, does that compare to like, a textual. Comprehension. Versus, an icon-based. That's. A good point, so and. Here is something I. Should. Have said, if. You. Like. Those, that perhaps, it would be useful, for people who. Would use icons, would be perhaps. Illiterate, or children. Who, are during. Their development, stage. But. Perhaps if you provide a good platform that would provide good and fast. Predictions. Would not require you go through a long. Cascade. Of selections, might, be even useful more than text. I. Was. Just uh I was, wondering how, polysemous. The icons are compared. To the. Their, corresponding, words or do you try and present, a more, specific type, of sense and then.
And. And metaphor, but the, the other part is you, doing. Context, of X versus glove embeddings did, you notice any difference in your performance. If. That if I got what you were saying correctly, and. So. For the first part you. Are talking about multi, sense disambiguation. Or. Representation. Of icons. We. Do have in the set different. Icons representing, the. Same term but stand one is a verb and one is a known. One. Way we can incorporate. That into our represent. Them basically is that there's, an earlier work I think by you as Goldberg about structural. Word. Embeddings, in which, you take into account the part of speech so then you end up having two different representations. But. The second one I'm not sure. Context. Exactly right I compared, it to glove and in. That experiment, we. Haven't, observed meaningful. Changes. They've. Been resolved, I hope what happened is all the rooms actually, went down so it wasn't the star room everything and we had cross I don't know what happened here but we had one of our images was showing up actually one of the other. You. Could have run down the hall there and watched her paper oh, yeah. Let's let's think cheer on again yeah I'll be moving. Giraud, you handled that very well actually, and i which it was remarkable, most people would have like i don't know what to do and you were like well just keep going and. Next. Up do we have a jay Nagesh for keep, your bearings lately supervisor information, extraction with. Ladder networks and waiting semantic, drift. Hi. Everybody I'm, a J and from University, of Arizona and this, is some. Word. That we did. Over. This. Me with me hi. Regarding. Semi-supervised. Learning. Using. Ladder networks. So. Yeah. So although, we. Have moved a long way in machine. Learning and we have had you. Know the, performances, hit the roof well, there, is still the problem of getting label data I mean as this, meme is showing that's. That's, the issue that's the elephant in the room so, this. They are so, for any kind. Of machine. Learning pipeline. The. Achilles heel is this getting, label data. It's. Hard, to get and it's costly and. Requires. A lot of manual effort now, the whole. Branch. Of semi-supervised, learning is. To see if we can augment. Label. Data with lot of unlabeled observations. Which, is available in plenty but and. Be. Able to generalize, from that. Well. How do we do that there are lots of techniques. Like. Self training code training label propagation, and unit and. Predominantly. They. All sort. Of the state of the art in, semi-supervised, learning is. What, is known as bootstrapping, which. Is sort of like an iterative, algorithm. So. You have well, let's, suppose our. Task. Is, to learn, names, of entities like, we want to some kind of lexicon induction, we. Start with some small, curated, set of seeds for instance so, we are learning say. Persons, organizations. Locations, and so on we. Start with say. Some, some cities like London, or some persons famous, bill Bill Clinton and so on and then. We, go to the data and find some. Patterns. Which. Occur with these entities and. Say. Blah, is the capital of something, is a. Good indicator, for a. Location. Entity. And, and. Then. We, go, back we, get some new candidates, from this for. Instance in this case it's New Delhi which is not in the seeds we, add it and we. We. Keep growing this in an iterative manner, now. One. Of the fundamental. Problems of. Bootstrapping. Is that, a semantic, drift will see an example here so suppose. We are learning female, names like you know you have you start with Susan rose and so on and then. You go to your data and you find some patterns you, find some good patterns which will indicate women, names or you. Also find some patterns, which might, indicate flower. Names so. For instance so in this case if the, algorithm, picks up those. Patterns. Then we add some kind identities, which, are actually. Not women, names but, flowers. And so. We are drifting from the original, semantics. Of. What. We intend to capture in. This so, this is a mattock draft now. So. How do we so, one. Of the fundamental observations. And this is our hypothesis, is that this, is caused by the iterative, nature of, the algorithm, of. Bootstrapping. And. The. Way to avoid this I mean this is shown also in a lot, of related work. That it's unavoidable. We. Have sort, of somehow mitigate this now. What do we do how do we mitigate this. So. There. Are ways to actually, mitigate it in the, same framework but, how about changing, the framework, how about not. Doing this iterative, but doing. Something in on one short manner now, there is a lot of recent, work some, recent work actually which does semi-supervised learning not. Illicit iterative manner but, in a one-shot manner and much.
Of The, much. Of this is in the image processing community, this. We term as one-shot learning and. In. This, work. We. Focused, on one, such technique, called, ladder networks. Which. Perform. Significantly, well, on a semi-supervised. Image, classification. Tasks. Beating. The. Hell out of all the baselines. Now. What's a, ladder, networks, so, it's. Basically. A. Deep. Denying. Autoencoder now. And just, like a general regular, auto encoder it has, an encoder decoder pair, and you. Start with, some. Entity and you try, and predict, back the entity that's a regular, auto encoder by. Adding some nice which is the denoise. Is. That, we. Want to like. To learn some abstractions, from these from, noise so. So that we learn some good features. Now. What. Is what. Is one of the key differences between a, regular, auto encoder and denoting denoising, auto-encoder, is that of skip connections, now. These. In. A denoising. Auto-encoder, it just. You. Know generates, back the, input, but. With. These skip connections, you also generate intermediate, layers now. As a result of this now what happens is it makes the, the. Network. Modular, just. Like hierarchical, latent variable models and. That's. One of the key differentiating factors, now. Along. With that there, is another. Aspect, which, is you, have another clean, encoder. Which does not have nice now. This, is predominantly used, for, those, examples. In your label data in. Your data, which have labels and you. Use, you. Know. Supervised, cost to, learn and back prop through, the network. You. Could use any of your. Supervised to us here now, the other aspect the. Other differentially, aspect, is that. The cost function is a combination, of that and. The. Regeneration. The difference between the reconstruction. And. The, clean. Target, encoders. Outputs, so that. Is that, is the that is known as the reconstruction, cost now. These. Are the two things which so by. Doing this. So, sort. Of giving. As. A framework, to actually, add both. Labeled, and unlabeled data. Now. Like. Okay so we, look at the task of named entity classification which, is sort, of similar to the induction that I was talking about. Which. Is basically we, are given some text, and and. The. Entities are highlighted we are we know the demarcations. Of the entities like here. And we. Want to basically find, the class, of the entity and. We. Do that by, the following so this is how we prepare, the, input, to the ladder network so, we have, these. Engram, patterns, around the entity and. We. Use pre-trained embeddings to actually populate the, the. Tokens of the different. You. Know patterns, and the words and, then. We. Have like an averaging layer which. Averages. And finds, the. Representation. For the entity and the context, which. Is a bunch, of, engrams. Around the entity and, we. Concatenate them. Okay. I think I have to move so, so. This is the input and along, so and then, this is fed, to a lateral, network framework which, is a simple, feed-forward. Network. It's. A two layer feed-forward network, and the. Noise that the add here is like. Gaussian, noise standard, Gaussian noise and. It's. Just. Like the, the. One that is used in ladder networks for image processing, now. Coming. To our experiments, so we use two data sets kernel, and on two nodes and these. Are the baselines one of the baselines. Is, state-of-the-art. The bootstrapping. System, the other is the well-known label propagation, system and we, started some initial seeds which, are random and so, very small sliver of the training, data set, we. Start a sleigh live, you, know the seeds and we, have equal representation in all the categories and we use, the same seeds. For, like. All the. Baselines and the so these, are the results so what, we have here is the precision versus throughput. Where, here we see that at each epoch these. Are the baseline so we add some. Entities. To. The pool and this. Is what the precision, average precision looks like over time. And. This. Is the ladder networks performance, we see a phenomenal. Improvement, close. To sixty two percent two hundred percent on the entre nodes data set now, we believe so there might be a number of reasons why. This works, well so. One. One. Might be that we, are using multiple different, you. Know D players the, other is noise which might act X original regularization but. One thing we did notice is because, of the flatness, we can say that it is sort of mitigating semantic drift in some sense now.
We, Did some more experiments, where yeah, we, you. Know increase the amount, of supervision, that we give and obviously. There are no surprises, here we see that the more the supervision the better the, results, so. These are some no concluding, remarks, so, future directions is that, we want to like, incorporate, those to other tasks, like correlation. Extraction, where label, data's hard to get by and try. Different in encoders. Like CNN. RN and so on. And. Also. Focus on interpretability. As, the reason why we had. These engrams. Rather than just averaging, the roots is that we wanted to actually track back which, set of patterns actually led, to a, decision. Which. Is sort of like explaining, the more so we want to like dump the rules and so on so yeah that's, what I have. Have. You thought about cases, where semantic. Drift might not be such, a bad thing like. For example I. Can, think of scenarios, where you might be able to you. Might be. Perpetuating. Biases, of certain, kinds. If. You. Don't incorporate. Semantic. Drift like is this something you, have ideas, about like, if you see female names for example always. With secretaries. Or whatever. That. Yeah. That might be a little bit of a problem right, well. So. The the underlying assumption is that. The. The classes, that you're trying to learn are mutually, exclusive. That. Is the, there's. Or there's very, little overlap. Yes. We might be perpetuating, biases, sometimes. A, bias. May be good in the sense that, like. For instance in this case if the data set had lot of say. Person, labels then you would want to predict more person labels whereas. The other algorithms which were iterative, we're, not capturing this kind of a bias but, bias is bad as well and I don't have an answer to that question yeah. Infrequent. Distillation, and. Using data programming. So. I'm. Singing, from a University, of British Columbia today. I'm going to present our work titled, here. This. Roots parsing is an important, task in natural language processing and, focus on parsing, text into this rhetorical, structure here. We have example of what. The rhetorical, structure of these pieces of text looks like ideally. This, should be a result of a discourse parsing software operated, on these pieces of text because. This, because rhetorical, structure includes a lot of information how, they attack this structure is extremely, useful for a wide range of downstream NLP, tasks like, summarization.
And Sentiment analysis, it. Can be done in two step about it's not necessarily, required to be done in two steps first, you just create this tree first and then you assign the relations, on the tree note here. The relations, are the purpose, and a list in. Our work we're mostly focusing, on the second task. The. Performance of the existing, message has been overall working, very good for the second task, these. Results, from the state of our discourse, parser, has the performance of. 59.7. On the, micro f1 on all 18 relations. However. This matrix, hides a lot of detail if we look at the f1 performance, for relations, result, is less exciting, we. Have some relationships, for being working oh, that. Has been working very well overall, but, we also have those that has been working are not as good like evaluation. Or topic change. We. Try to evaluate what causes these relations, to be not working too well not. Surprisingly, we, see that for. Those, relations, that have less. Sadness. For. Those relation, that has less. Training data it usually performed, pretty bad and for, those that have usually. More data it performs, better. Doing. By this observation we. Propose that you say sorry. Yeah. I think this minion. Given, by these. Done. By these ups. Okay. Given, by these observations. We propose to use a surprise, message to add more training data for infrequent, nations to, boost their performance. In. Order to do so we would like to apply a framework called data programming, introduced, by a group, of the structure from Stanford, in 2016. In, this framework one, could use domain, knowledge to, curate heuristic, functions, to, label, part, of the training data and then. These. Turistic, function can be combined together much, like how in, summoning method will work the. Way how they combine them together is, by, training, a graphical, model like, the one shown to the right. This. Graphical, model could learn an accuracy of each turn heuristics, by comparing, them against, each other and in. Situations, where a heuristic, might possibly, fail, to infer a label. It, can be just treated, as a hidden value in the graphical, model the. True label which is unobserved. Is also treated as a hidden, value in the graphical, model. Once. The graphical. Model is trained for, every unlabeled, instances, given. By their output, from all the heuristics function one, could infer, a probabilistic. Distribution over, all of its possible labels. However. This, type of framework cannot, be directly applied, to our problem. Because, it's. Impersonally. Impossible, to create heuristic, function for relations, so. We, try some try to use machine learning methods you replace. The heuristics function here. This. Is how we create the, connection. Or, labeling functions we, randomly, sample, half of the label, data for four times this. Give up four different, data, sets for, different but still or might, be overlapping data set and then, on each of the data set we create, a one your natural classifier, on top of it this. Is similar to the method of bagging. And. Then. For, each of these classifiers. We, we, passed the unlabeled data entry head and let it output, its label, as far as confidence, score the. Confidence, score which is soft. Max they're coming, from the last name of the neural network will. Be used to, filter. Out - to, filter to figure out those that have a relatively, lower confidence score. We. Try with two variants, of filtering. One, is to just set up a unit from boundary for all labels such, as as long as the confidence score is greater than predefined, boundary it will be kept otherwise.
We. Are discarded. We. Also try another variant which we call as dynamic. Boundary for. Each label the, boundary is dynamically, selected, so that the distribution of the label being selected, is the, same as the, distribution. Of, the label already in the training data set, these. Two ways will will. Generate different results as we will later see. After. We filter out the labels the, model can we can directly invoke the data programming, framework and training, a graphical model like this, again. Here white denotes hidden, value and talked to knows value that's observed. After. We trained this we. Can train our final model where the final model will utilize, both, Cece prized and I said only unsupervised, a data set and you'll have a loss function of this form which, takes into both the shipwrights count surprise. Part and the unsupervised, part the. Civilized, part announcer first part we'll post use a cross entropy loss function, the. Superest part will use the to label as its supervision, in the cross entropy where, the unsupervised, part will use the, label. Distribution coming. From data programming, as is separation. For. The experiments, we do our experiments, on the rst discourse, treebank dataset also called the rst DT for. The unlabeled data set we use the New York time annotated corpus, which, are also new stockman like the RSC des, in. The RCT T there are these top, ten most frequent, relations these, top terminations, folk forms. Almost 90%, of the total training data in the RS. DDT dataset and the, rest of the age relations, only, take. Up 10% of the data set in. Our work we are meaning, improving. The performance of, the lower, age without. Having a negative effect, on the top 10. Some. Extra experimental, settings take. The programming, which has a graphical, model Intuit is trained using stochastic. Gradient descent on the maximum, likelihood estimation, of the observation, and. For. The classifier, we use as labeling, functions, and also, final method for. Now we are always, using a simple, one hidden layer fee for in your network, we. Tried two variants, of fits one is, a standard one that does not have dropouts, and parameter. To be loosely tuned in cross-validation to. Make sure that the last longer changes. We. Also try another one which has dropout enabled, and parameter. To be more careful eating in cross-validation in order to reach the highest micro average f1 and, for. The features we use for the classifier, we use the concatenation, of the both of the feature both. Of the feature first. Is the human engineering feature, from the previous state of the art and the. Next one is the word - back embeddings, of the boundary, words on the discourse unit and. Here. Are the results we get we evaluate, it on four metrics the first, metric is the micro app all of them are f1 score the, first metric is the Micro F 1 square which is the most common metrics and the, rest of the are macro. F1 score which basically is the f1 score on each of the relations, but just taking the average without, taking, into the account of, their sizes. Let's. Look at the, first two column which shows the, performance, of the Union from boundary for natural one with or without in a programming, we, can see that for this Farren the one that has their programming.
Significantly. Output one that does not have data programming, this. Is not the case for, a neural network queue where we actually have a decreasing performance. However. The result changes if, we change, our way to do, the filtering if. We, use dynamic boundary, instead for. Neural network one we still have an improvement in performance although. It's slightly less, than what it used to be and. We. Are able to in. Three, out of the four scores, four and then two we. Are unable to improve the score on their micro average f1 however. In. Order to check in detail that our method, actually helped, infrequent relations we, check the performance of all, 18 relations, individually. Here. I'm showing the performance of the ten most frequent, relations order, by their frequency, in the training dataset denote, by the. Second. Second. Column and the, third column is the performance without using, any programming and the fourth. Column is the one that has their programming, you. Can see that here, we reach. Our original, goals such that we usually do not oh. We. Usually do not have a significant. Decrease in performance and, we, actually have some improvements. Which, is quite unexpected. And. For infrequent, relations we. Can reach some huge. Improvement, on those that has been originally working extremely bad except, for a topic comment and we. Mostly does not have a negative result, except for summary we're, still investigating, why is summary so different and. As. You can see our method does help improve. The performance of infrequent, relations, however, our performance, currently, still did not be the state-of-the-art so. We, propose some future directions one. Future direction is to utilize other, strategies, like meta learning one. Is to develop, better field shot classifier, and another. One is to use a deep learning based method instead and, that's. How. Is. That show bit here we need show bit. Again. And. Next. Up we have show bid Hathi presenting. Community member retrieval, on social media using textual, information. You. Can use that a few lateness. Hi. Everyone my, name is Ashe öbut and I'm presenting, our paper on community, members a table on social media using textual, information ah. So, just to introduce it with a scenario say, a user, selects, a handful of Twitter users and, wants, to find more people that belong in this community so, in, this example the user selected, a handful of NLP researchers, and would, want to find more people that tweet like these users the. Detection model, that we propose will only use textual, features from the user tweets, and. While most prior work uses social network connections for this type of task the, social graph is not always accessible and not everyone is well connected so that's why we're only using textual, features and, a. Major, challenge, with how we with. Modeling, for this task is that the, only labelled, training, data comes from the query itself so, the solution we propose is learning similarities. And differences. Between individual. People as a proxy task and hoping that this information carries. Over to the main task and we, do this using a person to be identification, task and. A large unlabeled. Collection of tweets and, once. We have the features from this. Proxy task we chain a logistic, regression, classifier on top of the on, top of these features, whenever. We are asked to search for members of a community. So. The main contribution. Was the proxy, tasks so let's just examine it real quick. We, use bag-of-words of, features across all the tweets of the users and we. Take. Two samples of a person P Street which, in this case would be eov Goldberg and one, sample from a person queue which in this case is Elon Musk and we. Define a user embedding you as the average of word, embeddings, weighted, by the log frequency, that. They appear and we learn an embedding, vector for each token of the vocabulary to minimize a margin. Plus the distance between the two samples, of person P streets in this case you have minus. The distance between the sample of person P streets and person Q streets so, intuitively what we're learning is what. Does you have have in common with himself that separates, them from Elon. In. Again, we hope that learning, this, representation, it. Encodes, information that, will be useful for picking. Out community. Members later on, so. The. Data that we collected, we collected, eighty thousand, users. From Twitter just by tandem Li selecting a thousand, chanting Twitter topics, in the US over some months last spring and we, found users tweeting about these topics and collected, their 2000, most recent tweets and.
We Also, hand created sixteen communities, which, were just groups of users that fit themes of interest to us and our friends and colleagues and so, the communities that we selected are displayed on the table here as long along, with the size. Of the communities and, the. Vocabulary had, 174. Thousand, unique types including. 49 thousand migrants that were selected using. Point-wise. Mutual information. 36. Thousand usernames and 17,000, hashtags. So. The, experiment, was configured. So we took the 80,000. General population, members and split. Them into 36, thousand to learn them betting's for the proxy task one thousand as negative samples to chain the classifier, and, 43,000. Evaluate. The model the, way we perform the model evaluation was, by holding, a community, member out chaining. The logistic, regression, classifier between. The remaining community members and the thousand negative samples and performing. As these ankang tasks where we scored, the one held out community member against, all the. 43,000. Negative samples and we, performed, this for every community member for every community and average the results and the, metrics we use for area under the curve and inverse. Mean the Supra clinic so. We tested our gid strategy, against word, Toback and laden dershlit, allocation. Based lines which are commonly. Used for this type of task, and. What we found is that our div features, do much better than the base lines you. Can see that the average. Tank really isn't close we, found that the idea works well even with small queries. As. You saw in the previous table some of them had as few as 11 members and we, found that when we took, the pre chain work to recommitting x' and then chain. Them with the v ID task we found that the resulting, embeddings perform the best so. To examine exactly. What the proxy tasks, were learning we, looked at the words that changed with the most between the work Tyvek embeddings and the work - Akande beddings later change using the v ID task and, what we found after clustering, these words is that, mentioning, one, of these words, that change a lot tells you a lot about your likes and dislikes was this useful in separating. You from some, other random negative, sample, so. Yeah. What. We also want to do is find out what, was considered prototypical, tweet by our classifier, so what. Words, that you can use to do the signal, that you belong to a certain community and. We found that, the classifier, and the beddings are learning things that seem. Intuitively, obvious for, example if you, talk about morphological. Priors for probabilistic, neural, word embeddings, you're, probably an NLP Z searcher whereas, if you talk about airline. Economics. Inter fluidity you're probably an economist, and. So. In conclusion we, propose a method for finding people that tweet similarly, to a specified, set of users we, introduce, a person D identification, proxy task that, allows us to create embeddings, that perform well at this task compared, to a, word to wek and of the embeddings, even when, the query set has very few users and these. Embeddings, depend only on texts but are complementary to social media features so they can be incorporated as well, and while. We use a bag of words model for, simplicity.
Any. This.v identification. Objective, can be extended to other more. Structured methods as well thank. You. Any. Questions, yeah. So. I actually, had a quick one the reai identification. Task is from, computer, vision is that correct yes, it's a the. Objective. The chaplet loss objective, that we used in the d ID task is adapted. From a computer vision task. And. And, that's, like for a. Closed. Circuit to like make sure that they the same user that is appearing, again with, maybe a partially, occluded face is the same as previously, yes, yeah. Joel. Are you here can you come forward go ahead. Any. Other questions. People. Like you can like, who who. Are they connected, to who. They retweet. No. We actually explicitly, avoided. Using social media features. Because. The, graph isn't always accessible. For. All users and for all social medias and some. People might not be well connected so. If we want to find people that belong in a, user. Interest group or community we, don't want to just find the people that are well connected to the query set but also other, people that have similar interests. So. Do you come across users. Who, fall under like multiple groups like you know if person, like works 9 to 5 and that tweets about NLP stuff and then, when he goes back home he, tweets about what is cooking so, there, would be a diverse set of tweets. No, you know what the user can treat, so you come across those instances, and we. Did. So. The. The. Way, we formulated, the problem is that you. Can specify, any query set so, it's. Up to the user to specify query, set but tweets about things they're interested in and. Our. Model, tries to find people that. Tries. To find similarities, between what the communities tweeting about so, if a person tweets about so. If you go to the prototype we'll treat. So, we had MLP researchers, that tweet about NLP and then also treat about other things but the classifier learns that what's considered prototypical. Treat for the community would be one that uses lots of words that are shared, amongst the community. Members. Yeah. We're gonna go ahead and take some advantage, to catch up again so let's thank our speaker again. And. Our. Final speaker for. The morning is Joel Adams doing semantic, similarity on. Conversational. Speech between children, with and without ASD. You. All. Right, cool. So. My name is Joel Adams I'm from CSL you and I'm talking about semantic, similarity and, differences in the speech of children with. And without autism, now. A. Lot. ISM spectrum disorder is a range. Of conditions which is classified as a neurodevelopmental. Disorder, it's often associated with challenges. With. Community. Social. Communication, with. Early case studies to discussing um. Intense. But specific interests, or repetitive language, but we find that on, the, speech of. People. With Pacific. With an ASD diagnosis, as is at least as heterogeneous as the speech of typically, developing children and people. And indeed, I'm dr., Stephen Shore who is a clinician at Adelphi. Says if you know one person with autism you know one person with autism and, in, our study we seem that we it seems to be that if you've spoken with one person with autism you've spoken, with one person with autism, and.
This, The. This the study of our of. Our lab is specifically on, quantifying. The variability, of speech. And. Children. Children. With and without autism and so how can we quantify this, in the language domain and specifically for this talk how can we talk about what's, different in what children speak, about. With. Infant, autism so let's, take a look at what that might look like so. What. Are children talking about let's. Imagine that we have some really short transcripts. Let's. Not wait some really short short transcripts, of children and this, video, then we have four of them this, first one and the second one we have I'm looking, for my pet dinosaur we just got out of jail and I just made the universe's most hottest hot lava hot sauce um these, seem, to be clearly. About different subjects and, we. Would expect a reasonable. Semantic model we'll be able to show that those are, different. But. Let's imagine we talk to two more children and the third one talks about I talk to my dinosaur into bed the fourth says that's a Lambeosaurus so the. The third seems to be similar, to the first it's about probably, a pet dinosaur being, tucked into bed and Lambeosaurus, is obviously about dinosaurs, but maybe it's a, maybe. We're talking about a different register than. By, Lambeosaurus than say these. Imprisoned. Or or. Sleepy. Dinosaurs, so. These. Are the sorts of similarity and differences that we're trying to tease out with it with with. The study in so how, might we do that and what kind of data would we use so. What. We use for a datum is, we. Use transcripts. Of the autism diagnostic observation schedule, or, canonically a toss, on. These. Are 60-minute language, samples from children with and without. ASD. Diagnosis, and. And. These are like, a semi structured. Task. Where the there, are certain questions of the examiner is expected to ask and there's. Certain activities, of the child typically, engages with but they're largely child driven but they are crucially, conversations. Than a dull examiner, in, our data, we're looking at children between the ages of five and eight and, we. Have 32. Children, who've been diagnosed with autism, and 30 to 38, to have three, to have not been diagnosed. So. Um. What. Did we used for semantic model we're going to talk about what these children are talking about oh and. To, be clear those those um those those, mini, transcripts. Were actually, subsets, of the, actual. A dose that we have there, are children, talking about dinosaurs and hot sauce. So. Support, this is started as a replication experience, parent, following Goodkind. At all and there they're a group in Northwestern University, who, found that using, word de Becque embeddings and a. Reference as, a reference, child selected, from the the pool of typically, developing children that. They could find, group differences in in. This word of X pace between what the children were talking about and we'll take a look at what that may have looked like in a minute so similarly. We used the, pre trained Google, News work to Beck embeddings these are word.
Embeddings For about three million words that result. In 300. Dimension. Vectors. We, then took for our, transcripts, we took each word in the transcript projected, that into this word de Becque vector space. Some. Though some of those vectors and then, normalize, the resulting, vector, into. Unit length and called that our semantic. Representation, with script and then we use coastline similarity in the space which, is a measure of the angle between them to measure their similarity. All. Right so, how. Similar are they let's imagine, that we took the very first child who incidentally, is typically developing, and. Compared. The. The, cosine similarity of all of the other children with, them and then split. Them up by group and this is what, that would look like this is actually of that subject, and and for people who are familiar with the north western paper this, is essentially, their their very first result and and we're able to replicate it but so. What you see is that by. And large and. This is the this is mean this is a dot here the. The average similarity, of typically, developing children to this. This. This reference transcript, are more, similar than and. Then, the, children. With ASD, diagnosis, so, oh and also you'll notice that there's there's more variability around, the ASD subject. As represented, by a 95, percent boot strap, a confidence, interval then. There is no that cheeky kids so but maybe this kid he. Admittedly like dinosaurs, maybe he like, that's an idiosyncratic, topic so what let's, choose a couple more kids, will. Grow out a couple more typically developing kids and see how whether, we still see the similarity. Um. So. The, one in the middle is the one you just saw and. And. You'll notice that there's, actually some variability in in in the results based on who you choose as your reference the, this. Here. The the groups are almost, indistinguishable. But. By and large we see that the typically developing kids are on average more, similar to the reference transcript in their transcripts and. The children see, diagnosis. So. Let's, look at all the kids, so. The. The ticks on the x-axis here these are all different. Reference, children all each, it, represents. The entire set of typically. Developing children, from our from our own data set and we see that what. Do we see I actually have a slides for that. So. Again. We see the language of typically developing children it seems more similar to any typically, developing transcript, then those, of those. Transcripts, of the children with autism spectrum. Spectrum. Disorder diagnosis, and. For. Every case we see more variants across the similarity, scores with children with ASD then then that's typically developing children, though. We do get these outliers um however despite the variability the. Pattern is pretty stable that the blue dots are above the red dots and that the if. The variance scores are the variance ranges, are larger, for that for the for, the red points so.
What. Does this mean, well. Also, following. The UM the north-western study yeah, in most cases here the differences is enough to determine group difference so so it does make sense that their their choice for reference document from. The typically developing children would, result in something like this at least in our data as well so. What. Does it mean it. Can mean a number of things um. The. First thing to remember is that these um. These these. Samples aren't from created, in a vacuum these are from a conversation. With the let, the examiner and, it's, possible, that. That. The the space. That we see between the two is just is an. Artifact of that may be the typically developing children are more influenced by the examiner, um. It. May also be that the. Children, who have ASD talked, about more idea syncretic topics. Um. But. It could also be other things as well it may be may be all typically developing children talk, about dinosaurs, and all like, shown with ASD talk about something that's not dinosaurs um that, seems exceptionally, unlikely. But. It could also be that children with ASD are just more linguistically creative and, that's that's what's being picked up here so this. Is what we're interested for future work so. The. First thing we're doing is in. All of the the previous slides the data that you saw all, of the, boy all of the girls have been excised from our data so we were just looking at children. Who are boys um. Um. Largely. Because it's just hard to find transcripts, of, in. Bulk of girls, with, autism. So. Um we, have an extension to one of our grants to collect. Transcripts, of eight offices with girls with, ASD, diagnosis, and so we're curious to see whether we see a similar pattern in the speech of girls. We're. Also as we discussed on previous slide interested, in studying. How much the. The. The language and semantics or topic space of the, child. Either, leads or follows that of the examiner, um, and, also do certain conversational, context because keep in mind the AOS has different, types, of activities, do, those kind of context affect the amount of variability in children's speech um. And this one's may be obvious but obviously, another future step is that. Representing. Our fairly.
Large Transcripts, by just summing all of the work. De Becque vectors and then. Can. Reducing. Them down to a unit length is a little coarse um. So we'd like to improve the the. Document. Semantic representation, as well so in, closing I'd, like to thank my co-authors, Alexander. Salem who's here and doctors. Allison on Hill Stephen bedrick and young Van Swanson um, and, if there are any questions I'd love to you. Hi. Great talk, have you tried comparing, the semantic similarity of, your. Data to a reference, speakers, wit on the autism spectrum I'm, sorry what was that have. You tried comparing, the semantic similarity of, your speakers, to a reference, speaker on the autism spectrum yeah, yeah that's actually a really good question we this, and I, was thinking as I drove up here that that, I might deploy, should've included that as like a bonus slide but. But. Yeah I've done that the exact. Same plot with um with instead, of the the typically developing children as the reference documents, using the su ones and it's it's. There's. A lot more, swapping. Of the positions of the the red and blue dots and I'll, like. Them, in in. A lot of cases the the differentiation. Between the two to, diagnosis. Groups are very, challenging using the ASD children as on. Um. Heads the reference. So injure analysis, did you consider. Where on the spectrum the children were, that's. A really good question I um, we. Did not for this with that um one. Of my advisors elements. Actually asked me specifically, to consider, that as a as, like, maybe a covariant. Here is like, the for example a dass store. For example but we have not looked into that yet. Time. For one more. Do. You have any ideas, of what, how. This might map on to clinical, work, if it's going to help us maybe. Pre a das collect. You. Know who may be at risk for autism or help, us decide on those difficult. Cases. Yes. So. That's. A really good question like how is this like like is there like clinical, utility to this I think that's that's always like the first question that we asked ourselves when we do, this research um and. So. For, example like like one one thing that may fall out of this is is maybe, maybe. Certain aspects, of of. The a das proved to be like better. Better. At showing this kind of semantic difference than others for example it looks like cases where the. Where. The conversation, is more structured like like the questions section of the a dose is. A lot more you see a lot less variability, but with the typically developing kids are a lot, more variability, but the ASD kids are. Relatively. And, a lot more group. Differentiation, so that's so if perhaps, like there's a section of the adults that's more important notice that seems like it might be before this, kind of like like. Semantic, similarity measure, that, might be useful in. Though in the long run but also. What. We're actually looking, for like more than like group differentiations, we're more interested in. Like. Using. These these measures of variation. In language as like, longitudinal. Measures, like. That's what started this thing so how much how, much does. This change we switched children, but maybe also like how much does, this measure change if we measured, over time so, those are the sorts of things were currently thinking about but, yeah we're early in that stage for sure. All. Right let's thank our speaker. And. Thanks to all of our short, paper presenters. For getting a whole lot of information in all not a whole lot of teacher on that was amazing.