Personalized Machine Learning: Towards Human-centered Machine Intelligence
You. Hi. Everyone thank, you for coming. Today. We have organ Rudy which he is finishing. His Mary, Curie fellowship, at MIT, in. The Media Lab and. Before. That he was a graduate. Student at, Imperial, College in London right and and this fellowship. All is. Paid. Or, organised, by European Union and MIT so he, spent some time in Europe, and some time in MIT while. Doing this his. Research in grad school was, started, with computer vision I assumed became, very, quickly machine, learning in general he's worked, on Gaussian processes deep. Learning reinforcement, learning. Multi. Model data. Analysis. And. Today. He'll talk about personalized. Machine learning in. Case of a very interesting case of, helping. Doctors track treatment. Of, autistic children. Which. Presents, very interesting. Problems. With domain, transfer, metal learning on, one side and the other side reinforcement. Learning and. I'll. Just let him talk all about, this hopefully, he's not going to talk about everything he's ever done because he's done a lot for somebody who, is just finishing his postdoc and. He'll focus on one. Or two things that he's, almost recently, so oh. Thank. You dementia for introduction, and. Can. You hear me well yeah, well. It's a great pleasure to be here and thank, you for coming so. Well. Now. We should help me skip the first slide because but I will just walk it through to get some nice pictures here so I did my, bachelor's. Degree in, automatic, control theory at University, of Bahrain then. I did a master's in computer vision. PhD. In computer, science at Imperial College in London when. I focus on Gaussian processes and sequential. Learning mostly conditional, random fields, modules and then. Now. I'm doing a postdoc, in the final stage, of my, Molecular. Fellowship where I work on on, deep, learning reinforcement, learning. For. Human. Data and I will show you what, exactly I mean by this, so. Little fellow that I collaborated, with mostly in the last two years. So. In my talk will focus first on my previous work that is mostly concerned with the facial behavior, analysis and this is something that I did during. My PhD and in. The first year as a postdoc, at Imperial College then. I'll, talk. About robot, assisted duties and therapy and. I'll. Use this as a use, case for. The modeling. That I'm trying, to do in. Terms of personalized machine learning. Then. I will describe. The technical, background on the personalized is deep networks that I worked on over, the last two years and. Hopefully. We'll have time to go into. Add some to cover some other methods that I worked on and that. Span. Like a more general human data. Analysis. And personalization. Techniques, and. Then. I will. Show. And, talk about my vision, how I would approach, the other challenges, in, personalized, machine learning, so. When. I'm modeling a human, facial. Behavior. So. What is the main reason for, doing. It so we have many applications, like for human robot interaction like. Therapy, pain, monitoring, gaming, in. Vehicle computing, and also like. A human-computer interaction, where we want to detect a certain cognitive and emotional states, and face, is one of the most powerful channels, of our nonverbal, behavior, and. That's. Why we want to encode like. How people express their. Emotion and these states through facial expressions, and using. Computer vision and machine learning we. Can do, that so. But before that we need to establish some standards, here so, how do we describe. Faces, and there. Are two, general. Approaches one is the message judgment which, is more subjective and relies. On our perception, of emotion and cognitive States and.
The. Old-fashioned approach is to use the categorical, approach that we classify emotion in two basic. Categories but. Also there is a dimensional, approach where we can use a more fine grade scale in, terms of valence. Which. Means how positive negative emotion, is or arousal, how accept what is the level of excitement and then there, is still this. Like a mapping, from, the categorical, two-dimensional approach so these are different metrics, that we would use to describe faces, also, there is a more objective way. That. Could allow us to build. A phase gamma and these are facial action units and those. Are the activations, of the facial muscles that enables, us to capture. All the variety, that we can express. With our faces, and. Going. Given a step and further we. Can encode, their intensity, and as, you can imagine like. Here. We have more than 32 facial action units and then for, each we have like more than five intensity. Levels so. The, problem of like, facial expression cognition it may seem. Similar on the outside because. Like, expressing, faces simple motions we can recognize it using the the, basic. Computer. Vision but if you look at this problem this is more, than just. Recognizing. Basic emotion categories here we have a multi. Task learning. We have a distribution of different. Intensity. Levels of, multiple. Tasks that we want to model for each task is the different action unit. So. And, also, there is a we. Can always use this. Sign. Judgment approach to go back to the perception. Approach where we want to map, action. Units into emotion categories but, overall. So. What I've worked on in the past was like given. Different problems, like static, images multi-view images, and image sequences, i've also different machine learning a, computer vision techniques in. Order to achieve, what is in the output recognition. Of the mode of the emotion category, or the action, units or, their intensities, so. Now I will just show. The snapshots, or the, techniques that I worked on I won't, go into more detail because I will focus later on the technical parts of. The personalised networks that I want to, to. Introduce here, so. One, of the methods it worked was like a discriminative, Gaussian processes. Where. We, use the data for multiple views, to, find the bedding in which we classified, the emotions and then whatever. View the person comes for, the. First phase, come from, we can do classification. Of the face in the common space then. Recently. I worked with, my. PhD students at Imperial College on a, combination of deep, congressional, networks and nonparametric, models like Gaussian processes where. We found a very powerful combination here, using. The commercial networks - hottest, feature extractors, of. Different. Parts. Of the face while, embedding. Those into a lot of national space using Gaussian processes and. Autoencoders. To. Estimate. Intensity. Of facial. Action units. Then. I've, worked on the conditional random field models. Which. Which. Where, I extended, the thus. Under conditional, field the model sequence, modeling by introducing a latent variables, that, encode, the relationships, between, different action unit intensities, and those. Relationships. Are encoded using ordinal, variables, where, we know the level tree is always higher than level two so, this information is very important - if we want to constrain the model than predicting different intensities. Of facial, action units. And. Finally, this is a valve. All that, that. We try to combine multiple. Data sets that each, of the data sets had a limited, number of annotated. Action units and some. Of them were overlapping, across, two different data sets, and. What we wanted to have in the end is a model, that can given, a new. Dataset recognize. All the union. Of the action units that, existed, in the training data and for, that we used we. Formulated, this as. Was. A problem of modeling multiple, distributions, of action units and their intensities, where, we used a couple of functions which are very. Powerful. Statistical. Descriptors. That model, dependencies. Between the pairs of marginal, distributions, so, by using these techniques, I I. Managed. To to, to. Build the networks and the different, models that could automatically, recognize, action. Units and their intensities, from. Face images. Kyler's. While. I was, still. And I'm still excited about that. Work I I. Was, looking for something that would. Help. Me explore how these techniques work in the real world moving.
Beyond, The standardized, data sets where, we build, the model. Test. Our and test it on the data set and then achieve, the state of the art and we move forward, to the next dataset so, what it did I, when. I started my fellowship I started I focused, on the autism therapy, where. We have a real challenge of of. A therapist, working with a kid and using. A robot as assistive, to and trying. To, use. The tool to engage the kid in the therapy content so just to give you some ideas what, is it like, so. This is an example of the other, therapy that. So. We have a robot. Here. Maybe, can stop. So. What. You saw there is like, the robot was used is assistive tool to. Engage the kid in the therapy and also. Teach. The kid expressions. Of typical, emotion categories like happiness sadness, joints, and so, on. But. What, is the, reason for using robots, here I would like to motivate that first before I move into the technical part so, robots, provide this safe, environment for the kids with autism. And. One. Of the main, challenges that, we have been working with his with, autism is how, to maintain, and, sustain like. Their engagement, levels they easily lose engagement, in the task and that makes it very difficult for a therapist, to. To, proceed with a little bit the learning materials. So. We want to have these tools that can keep, the engagement of the kids keep. The kids engaged during, this therapy, in order to improve their learning but, also there is another very, important aspect of of. Having this kind of, hardware. Or like assistive tools. In. The, therapy, it, doesn't have to be robots it can be any other kind of like. A multi-sensory. Hardware. It. Cannot amend the therapist by monitoring, the. Behavioral, cues of the kid and summarizing, those cues at the end so, that the therapist can see. Ok this approach worked for this kid it didn't work for this kid so this kid was more engaged with this therapy part and then the other so, we can derive like a personalized, therapy content, which is critical, for. Learning. Outcomes. So. In. Order. Only. The service. Bullying. Is the surfaced yes, so the moment that. Working. With this kid. So. So traditionally, the, therapy would proceed. With. The therapist and the kids working using, like the face images, and then, the top is showing okay this, is when what, the person looks like when he's happy or sad or, angry, and that, the kid would need in the next stage to recognize to, pick up some images when the person describes a certain, emotional state.
But. Here we have a robot that is doing, there, all of actively. Participating. In teaching. Of this emotional, responses. But, also, passively. Collecting, the data and unable, incas to later analyze. What. Were the behavioral patterns expressed. During this therapy content. So. Here. So the sessions were 35. Minutes long on, average, so. Defaults like a one-off, recording, of 25, minutes, sessions. Between. Indies. Like a three, L's the therapist kid at the robot we, had the camera place behind the robot so. All this data is recorded. At, to autism, centers one in Japan and the other one in Serbia, so. I went there with it, my colleague from Japan at the B we. Used no, robot let me get the platform for data recording, but we synchronized, all these like a multi-modal recordings, that we did. Which, included, in the robots camera, then. The audio. The. Audio recordings, as well as the autonomic, physiological. Data that we gathered, using the M, particles. Wristband. That. Was worn. By the therapist and the kid. The. Speech. So. It is a educational. Therapy it's a emotional, therapy. Yes. So it's a standardized terapy that would be done in, the in case of. For. The kids with autism. Using. As, I mentioned earlier like this kind of images, the. World we have. In, the u.s. whenever critical, programs, with such survey, yes, Abba, program alike shows this survey. Problem is to find a particular lady. With. The robot but such. As no so this is like, a standardized, program that. Would. Be performed by the therapist even if you doubt the presence of the robot the. Robot was. Used. Here, as just, additional, tool that. Was used to, to. Demonstrate some parts of the therapy. Increased. Engagement. It. Was, something. That we wanted to see in the beginning whether, the robot would increase the engagement has maintained the engagement, of the kids so, what we had is we. He, the. Therapist also here the access to, the keypad from, where he. Or she was a she in this case was, controlling the robot so, when the kid was got, disengaged, the therapist could press a certain button at the robot, would do something, and the kid would is originally engaged in, the therapy. You. Mention here 25 minute monitor these sessions does that's normal so this, is don't, have a videotaped. Session, and they would go through it afterwards, to see what. The engagement, was like that's. One of the most difficult. Parts did the thirties. Are facing. When. Analyzing the content of, the therapy so what they do they record, all these therapies. So, this therapy can be 25, minutes on. A, daily basis so, if you think of this like on. A monthly basis, it would be hours of videos but what the therapies do. Is they, watch these videos after each therapy, and then, they try, to see okay these are the segments when the kid got engaged or. Disengaged and, then manually annotate, this so so, but you, see like a video therapist, doing, here is only 25 minutes the work that it takes after this to, analyze this data it, takes them hours, so. You can multiply everything by by, five or ten times of the real time they spend in the actual therapy. So. That's, another reason, why we wanted to, to, have tools that can automate this process and, assist, the therapist at least give them some suggestions. These are the interesting segments of the video, to. Focus, on to do some filtering. So. In the an overall. So, I didn't say here so our main goal here is like using this multi-model data or.
Therapy. Sessions we, want to make, personalized estimation of the effect, and engagement, levels. Of, the kids within. The. Participating. In this therapy and the engagement, what. I mean here is like, how engage the kiddies in the tasks with, the robot so, the, therapies were given. A. List. Of rules how to rate this engagement levels based. On the, part, of the therapy but, also valence. And arousal reuse as measures of effect, and they introduced them valence, will how positive or negative emotion of the kiddies and, arousal. How excited, the. Kid is in the during. The therapy. Okay. So, an system, like. Overall. Design. Of the system that, we use a human robot interaction could. Be in, a simplified way it presented like this so we do the data sensing. Then we have perception, and interaction. Model so we want to go to this pipeline from the beginning to the end to, explore to use the data to, do some reasoning about this. Using. Machine learning or computer vision algorithms, or together and then, inform the robot about the cognitive emotional state, of the kid and then, here we have different interaction, strategies designed that. Based on those inputs I'll, enable, the robot to respond in, a in. A socially appropriate way, and socially, intelligent way, during, the therapy, so. Whatever I'm going. To focus on here is the perception part. Once. You clarify, so the the, props and the therapy, process is, the. Robot the roles of the robot is is. To like. Only. Observe. Emotions. Of the kids they, don't they. Don't like, react yeah. So. In. This example that I'm showing, here the. The robots are only observing, and. Controlled. By the therapist, and the. List of their like. How or, the, instruction, for the therapist to to. Like press button how to press, yeah. So the temperature were instructed, about. The functionalities. Of the, robot and those functionalities were designed, based, on the therapists. Suggestions, so what the therapist would like to have in that in the. While. They are doing the therapy so. For example, one. Was like okay if the therapist presses the the. Bottle one the, robot, shows the expression, of happiness so they. Were they, had like a clear like. On. The keypad we denoted, all this like, different emotional, categories, and other, activities, like waving at the kids saying something to the kids things that would reengage, the kid in the therapy. So. The first step in processing of all this data is to, perform the data sensing, and for, that we used open. Source tools. In. Terms of the autonomic physiology, data we use the empathic, ac4 respond. That proper gave, us the. The. Readings of epidermal. Activity, which is the measure, of the autonomic arousal. Then. Body, temperature, heart rate also, the accelerometer, and the. Mudville. Impulse a data, so. We, process, this, data using other, tools that we developed, in the lab that. Also we use the open face tool, for extraction. Of the facial features like. The. Head post facial. Points, facial, action units and, also their intensity, and. Views. The open post tool, for tracking, the the, body joint. And. All, this was performed, applied, the. Video and the trend in. The near real-time. And. Views, for. Audio descriptors, reviews the open smile tool that, gave us the, 2400, level audio descriptors, of the. Conversations. That we are performed. Between the therapist and the kid we. Did the segmentation. Late of. The. Audio. Content that belongs that belongs, only to the kid, to. Do. The user. Separation. So. The this is a very, important part of. The. Data processing, that we used to. Build, a later, machine learning system for automatic. Estimation, of engagement, yep, you. Also, track the facial, expressions, of the therapist, yes.
We Did but we didn't include that in, this study, because. Here we focused only on the engagement expressions, of the kid but. Yet we have this data and. Also to know physiology data of the therapist as well yeah, and, that, was envisioned, as the next step to when I was the synchrony, between the, kid and the therapist and also engagement. Of both in the video, robot. So. We this, is one of the typical tasks that therapists. Would do after, the tap as I mentioned early they would watch this audio with visual, recordings, and then. Associate. To them. Estimations. Of valence, arousal, engagement, and try, to find those most interesting, segments, so, what, we did here after. We recorded, all the data we asked five human therapists, who were not participating, in this therapy, to. Look at those audio-visual, recordings, and provide continuous. Estimations, on the scale from minus one and one using, a joystick of how. Whatever. The valence arousal, and engagement level of this kid. During. The, therapy. We. Combined these annotations, doing, some the by. Applying techniques, for the sequence synchronization. And. In. Order to obtain the single ground truth for from all these five therapists. So. Each, of these annotations. Were done independently. So. Yes the gradually go down up and down from each magic that's nice, yeah they go through the video three times yeah, yeah, it's. A bit it's a very laborious. Thing to, do and. That's. Another motivation, I want to automate this process. Okay. So in, the traditional machine learning how we would approach this problem is that we would ever be. Able, to take all the data of all the kids and then perform some feature structure, apply a predictive, model and then, in the output to have the for, example engagement, estimation, but. What, is. The. Main assumption, in this approach. We. Want to maximize the performance. The. Average performance for, all the kids so, this, approach. Doesn't allow us to focus on each individual, but it would just take all the kids and treat them as a single domain, but. As we know kids with autism are. Very. Different. Between themselves so, every kid has a different, and, very. Unique ways, of expressing their engagement, and affective. States so. Treating, them as a single domain can. Be a bit, misleading and, the not, optimal, in this case but also in the case of general population where every, person has individual. Differences, that. Set us apart and. Building. One approach for everyone. May, not be the optimal solution so I. Show. Here a few examples, why. What, is the consequence of these differences. So imagine we are trying to build a like a classifier that tries to classify, neutral. From expressive, faces from. Some, training, set of individuals, and we, find that for those individuals, this is the optimal decision boundary, what. Happens when we have a new individual, that. The generic. Classifier, is not optimal anymore because, the date of this individual, falls into a different region in this specification. Subspace. So, what we want is a, way. To shift this journey classical towards this, ideal classifier for this test subject because in the end what we are interested is how, well, we perform on this new subject. So. This. Is another challenge. That comes, from this heterogeneity. In. The data that we are dealing with the been working with the. For. Example in this case with, autism, what. You see here is two dimensional space is a projection, of the multi of the, the multimodal. Data which, is audio-visual, recordings, and the autonomic physiology, date and of. Of, this kids so what we did we, used all their like. Sequence. Data and then, applied. T-sne. Which is an unsupervised, without reduction, technique that, tries to find the low dimensional, in this case two-dimensional embeddings of the high. Dimensional. Data. And. For. Each kid what is important here is a dis classes at this clustering this, projection is done in unsupervised, way and. What. You can see here is. So. All these points are the points is belong to one single frame of those. Kids and if. You, notice this clustering effect then. You will see that each, kid is some. Forms, a little cluster in this subspace, and this. Evidence, is what is the level of heterogeneity, that we are dealing with here and if. You if you think about. Building. A classifier. That will work in this part. Of the space it, won't be able to generalize if. You get the data from the. Kids that come from the other space or this subspace. This. Is another. Sign. Of, differences. That we are dealing with when. Working with the human data but. You but you see here. Distributions. Of the annotations. That we obtained, for, this data for, example in, this case we have this, row, and valence.
Which. Verily, the excitement, level valence, is how positive or negative is the, motion for. The kids for, all the kids that we use in this study and, for. Whom we got irritations by human coders so, this distribution is, distribution, of these, two. Dimensions of their annotations. For, both cultures, but. What happens when we put. These distributions. On the, individual, level we completely, get different. Distribution. Patterns, so. It meaning it means if you build like a population, level model it would focus on. Modeling. In the output this, kind of distributions, what, happens when we go, to the, when, you want to apply this classifier, we are dealing with a completely, different and as. Evident. Here multimodal. Distributions. So. That classifier won't, be optimal. Any more of an apply to this kind of data and. Something. That although. We didn't ask kids with autism to report their, engagement, levels but what we usually. Face. Is. Another challenge in, working. With the human data is the. Escabi. Report. Our, emotional. Cognitive states for example in. The case of pain we, have multiple metrics so one is self rating which is user centered so, pain, very high pain for one person may mean completely, different for another person. And. Then we have observer I think that someone external. Is watching us. For, example our facial expressions and these reporting. But. Pain we are experiencing, so these are all subjective, metrics we also have face, that. From, face selection units we. Can apply some, formula. To. Obtain the pain level but, what. I'm trying to communicate here is that we have multiple, ways. To encode the, same phenomena, that we want to model and this. Vary a lot between. Individuals. So. Ideally. What we are interested is is a system, that would allow us to. -, to, handle these differences, by. Focusing. On, maximization. The performance for each of these individuals. And. So. What. In. Order to to tackle this that, I worked, on the design of deep. Neural network that will, encode. These individual, differences. By. Having. A hierarchical, structure so. I'll start from the top level, so. What we have here so, here we have as. An input different. Modalities, visual, audio and, physiological. Where, we. Apply so each of this. This. Will work. So. Each, of these layers is one layer in a deep, neural network, and. What you have in the. Top part, is. Auto-encoders. That take as input the features, that are very noisy try. To encode. Them into lower dimensional. Subspace. And. Then we perform the fusion of this feature at the something, that we call the context, layer so, why, is this the context layer here, we do the segmentation, of the kids and the. Network. Structure based on different. Metadata. Like, the, culture, gender, and the ID number, but also we include subjects. Called cars. It's. A child come autism. Rating scale so this is the rating the fit in the original. Scale. By. The doctors, let's. Say okay this kid is this much. Verbal. Or this, is, what. Is the motor ability these are motor abilities, of that kid and so on so, this gives us some very strong prior about, these, kids and. Finally. At the individual. Level. What. We're trying to do, is like once we built the whole structure of the network we, are making the, predictions, for, that. Specific. Kid. So. I've now introduced. The. Technical, part of this network so. For. That developers define, the learning operators, that I, used to. Do. To. Build the network so. The first one is auto-encoders, where we have the. Encoding, of the input and the. Coding, in the original space but, also this, is a company, to be the companion function, which, is making. Sure that the. Embedding that we get here, h0r, also a.
Good. Proxy of the, output phenomena that we want to model in this case engagement, valence and arousal metric. For. That we use the the linear activation layers. Then. There is another one this is called the learn. Operator. Which is a connection, of the 1d player followed. By the regression layer that is estimating. The. Output metrics valence arousal and engagement, and for. The deep, layer we use the rectifying. A linear unit. Because. Of the problem of the vanishing, gradient so we ought to have more robust what will there and, this, is basically the the, function that we used for. Training. This individual. Like, layers. In the sequence like, two to every, time two, layers at a time so. The first part is encoding, the mean squared error of the of the, predicting, the certain the target outcomes the other one is the encoding part. Of our, impt features. So. This is one interesting operator, is called nesting, so well so, because, the idea is that we do this sequential learning, of the network so we learn one layer at a time then. We nest, another, layer which is a replica of that layer and then, we relearn. That. Layer, that. That layer by. Object, by optimizing. Only, its own its parameters, by, freezing the parameters above and. Then. Cloning, is the final. Operator. That we. Use here is and what. It basically does is when we have this one. Layer in the network and we want to split the, kids based on their gender we. Need to replicate that layer horizontally. So. That he can focus on the males and females. And. We do that by taking the same layer that, was learned. Jointly. For these two groups and then. But, after replicating. We initiate the parameters of each layer using. The parameters of the original layer. So. How the learning. Proceeds here so we started, with the group, level Network. Which. Has multiple, layers so, the first one is encoding layer then, is the fusion layer. That. Cultural gender individual, layer. And the the final individual layer is for prediction, of the target outcomes valence. Arousal and engagement, so. What we do first. We start by, learning the, the. First layer. In the network. Where. X. Is the input, multimodal data. Encoded. As. I showed in the previous example. Then. We learn, only these two layers once, we learn these two layers, we. Freeze them and then we, perform the nesting of the next pair of the layers. Then. We freeze does we. Learn those layers we.
Freeze Them and we perform another nesting, and so. On that's. How we go all the way to, the bottom of, this network until. We reach the individual, level. And. The. Parameter optimization is, again done using the. Mean squared error as loss. But. What is very. Important here is that once we do all this layer. Wise training, which, we found that is that. The benefits from two things the, first one is that by having these companion, functions here make, sure that as we go deeper in the network that, we are preserving this discriminative, information, about the outcomes that we want to model another, important thing is that after. We have trained the whole network it is really, needed to do one final pass to. Synchronize. All the parameters in the network by fine-tuning, them jointly and, this is done in the state the second step of. The learning, of the group network. So. Once, we have this, group network. The. Next step is to. To, form the personality's network without any further learning so, what happens here we have the group level network. We. Can discard. All these companion, functions that were used you, can think of them as the regularizer, that were used during the training of the network so, once we have this group, level, Network this is that, is well initialized. What. We do we do the personalization, using the cloning operator so. We come to the cultural level we, split it into, two, parts by, replicating this layer then, we go to the general layer and then, we, keep splitting it all the way to the individual, level. And. This. Is a kind of structure. That we arrive at it has a tree, like it's like a binary tree. Structure. And. Once. We have initialize this, network, for. For. All the kids in the data set, there. Are two very. Important steps that are needed in, order to make this work the. First. Step is called a fine-tune one so. What we do there is now. That. We have this new structure we, need to find unit for, each individual, kid but. The first attempt that we, did was, ok let's take the data of this one kid pass. The data through, that. Specific, branch that list from the input to that kid, fine-tune. It and then move to the next kid once. We did that we realized this doesn't work at all it, would over fit each kid at the time so. It would completely. Diminish. The influence, of the other layers and. It. Would leave other kids completely. It. Wouldn't work for the other kids so. What. We did then we. Said ok instead of optimizing. This, using. All the date of that kid let's do the random sampling of the paths from, the input to, the, to. The node of that specific, kid and using. Only a few examples so, basically what we did we did a sampling ok. One. Sample of that kid do, I pass video has a gradient descent then. Take another kid, do. A sampling of this of this sub, path of, that kid so for example if you are doing for. For, this kid here. Then, one, example could be optimized equally this and this layer. Using. The, data of, that kid and then. We would do this randomization, with the service gradient descent until, we would explore the data for the kids that we had for training. Once. This part was done and. The. Levels yeah so, you can think of this like from the input. To the output as a chain and. All. That chain we are sampling sub chains and. Then. Of freezing. Everything else in the network and allowing, a little to fine-tune only, those parts. Yeah. This. Is scalable, to the numbers like if you increase the number of people. In, your, if. Increase, the number of, individuals. Yes. So. Well. It depends, on what, is the kind but, is the network structure you would like to design, so. It depends how many factors, you can like, learning one joint network, shouldn't, be. The. Complexity, will depend only on the depth of the network that we would like to hear. But. In the end it, depends how many individuals, you have here but is a very nice, property of this network is that we, can learn, different. Parts of the network in. Different sites and then, merge the network and with. Additional steps, the fine tuning get, to get a global network and then, decompose, again and send. To the to. The original sites I, have that part in the future work if you if you make it there so I will show, you how, some. Ideas how we will do that using. Decentralized. Learning. Does. This grow exponentially, the children, are linearly. Well. It is depends on the depth so. So. It depends on what are the intermediate factors that we have here for. Example in this case we are modeling, culture. And age. So. Depending. On how many factors, we have is exponentially, with that maybe with, the factors but which with individual, children it. Seems, this linear, it, is it's. Linear in the rubber of the other kids but if, you think of this is a binary trees that's. Then, it will have like a logarithmic. Complexity. In. Order to reach. The. Because. Doing some chains some change, here. In, the second set so it defined you know two, parts what we have, which. Is again, very important, is, once. We have tuned, the network. Parameters, we.
Need To go through every single kit and do, the final fine-tuning. However. Here, we are safe because, these, parameters, of these two layers depends, only on that kid so. We are allowed to, fine-tune. Them as much as we like because, they want to affect the other kids in the network. On. The left-hand side this, neural networks developed, all. The kids yeah. So this is for all the kids. And. This one is also for all the kids is the same network with the kids. Specifically. Years so yes oh so these two are, layers. That are specific, for the, kid okay for example. It. Would be yes, so if we think. So. So. This is the network. That. We have in the end. So. What you saw those two layers would fall under. This. Kid. And. Then. Depending. On how many layers we, would have, here and the kids that would. Affect. The network structure. Okay. And so. Here. Are some results about like the learning performance of the network and. What we see, here is that okay, so here I complete three types of models the, the one is that as the baseline is the multi-layer perception, that, was trained without this sequential. Learning, it was trained all the layers. Simultaneously. Then, the, global, group. The. Group network that, was used to initialize the personalised, network was. Is. Shown in red and then. The persons that work, is. This one so what we are seeing here is, the. This, gap in. The, error, reduction during the training. So this, shows us that. That. Due to the flexibility, of the personalities network we. Can fine-tune. The. Network parameter is better to the target kids. And. Definite. And what we also see, here is it compared. To the standard, multi-layer. Perceptron, we still get the benefit, of trained training. These layers sequentially. So. If you look at the other results that we that, we get here. For. The process the network and. When. The group level network, these. Are the intraclass correlation coefficients. That we use as the as the performance metric and it measures what is the the. Consistency. Or agreement, between the model predictions, and the ground shot provided by the human labelers. So. The scores range, between 52, and 65. And one. Thing you can notice there is that there is a high variance and. This is because these results are obtained by averaging over, all the kids that we have in the data set so we computed, the individual performance, and then, we average, those so, this, is another, indicator. Of the high heterogeneity, in this. In. This population. However. So, we see that the clearly improvements, over the group. Level network which. Are more pronounced, if you if you look at this metric that I call it a task rank it means if we, were. Comparing on how many different tasks. This model was performing, better than the other. This. Is the percentage, so. In, 46. 4.5. Percent. Of cases. The. Personalized model was, outperforming. All these, other models that were used. Here and each, task is defined, as. I mentioned there like for, each kid we have 3 output which is valence arousal and engagement, so, it would be 100 5 tasks in toto. If they are new how do you know they are in the video. So. We know, the idea of his kid when you are doing the predictions. You. Know the ID but when you train. These. New kids are not seen so this. That's. Another problem. That we have here a limitation, that I will talk about so.
This, Network is a subject to dependence so we, need to have all the kids present, during the learning of the network. When. The the, way we did, the learning here. Is. That, we did a split, off. Of. The other kids data into training, validation test. Set. But, the, data of each kid was present in all these subsets, here so it's a subject dependent model. When. There, is a so, here. What. We also have is a ok. Person one is multi-layer perceptron, this is a child, dependent, multi-layer perceptron, so, it is a multi-layer, setting trained only on the on the data of that kid. On, integration support activation, got the boosted regression trees, at the base lines that we that we use here as we. Can see the litigation under perform significantly here, but what is interesting is that the. Performance that we get in. The live one child, out experiment. So we trained the network and then we apply, this network, and assemble, learning method very, would have multiple, kids in the output so we assume, that this this, kid, is. It's. A new key that comes to the therapy and then, we don't know what is the culture or gender for example and. We. Just take the predictions by all the the models at the bottom layer and average, them and this is the, performance that you will get so. Showing. That this models without knowing, this. Like. A meta structure. Metadata. About a specific kid, or that, having, access to any, data. Of of, this kid in terms of like behavioral, data to, fine-tune the network layers would. Result in a very low performance. Few. Questions here so the Gunners got last pain, could, you see. Which mom agrees with, the. Human. Brain is the most on that child and infer the. Gender and the culture, and and, so on and find children. That are similar to each other well. That's. Interesting we haven't tried it we. Haven't tried, that. Kind of knowledge is to look to. Classify. The kids based, on the agreement with the human coders but. This one. Nobody. To see that we were think about how, we can do this classifying automatically, so. To. Find, the structure that. Would be optimal, in terms of the variance of of. Each kid data. Balance for each kid in different, levels of the model hierarchy but, one. Of the reasons why we did the. Network structure we design the network in this way because, we wanted to also have space for into interpretability of the network which, was very important, for, them for. The therapies they, wanted to see like. We. Used so one method that. I described, in the paper and, that we published, where. We publish this work actually. Published this is the first work that was published in here science, robotics on a human. Robot interaction, and. We. Showed that we can analyze, by. Analyzing, the, outputs of different layers we can see. The, matter of the individual differences at the cultural, level and the gender level and we. Showed that that. In terms of them. In. The expression, of engagement, the Japanese kids when they were highly engaged, they. Were very. Very, still. While. The, Serbs. While they were highly engaged they were moving. Around a lot we, found that from the activation, of the, network layers by. Analyzing, their gradients, we noticed. This individual, differences which, were very interesting to to, show that the these cultural. Components. That, they differ a lot between these two groups. Previous. Questions why can't you just take a new child and the. Child busily, in the tree and there's just trained and lastly yes so we can, we can include. So. If you vote to extend this you have a little bit of data so, the doctor, wouldn't label a new child for 25, minutes or whatever and yes they would you. Just add it to your model as the one more leaf yeah. So you can have a model put a child so that they don't have to label anymore so, that's the the main potential of this model that we can easily, incorporate a, new kid by knowing the metadata of that kid so if you know the culture and gender of the new kid. And. Gender you have individual. Individual. Kids you do have to have some training data to, train that last player yes, so we must, have the this, last, the. Data to chain the last layer, but. The good point I, mean, the advantage of having the pre trained model is that when, we have a new kid that we want to include in the network we, don't need to. Come. Out of date of that new kid because the model is already, pre trained on the, upper layers, so. What we only need to fine-tune is, the last layer of that kid so, we. Haven't explored. This in this. Work. Yeah. So that there's a plan to. Go in this direction because exactly. Growing, this network how we can take advantage of this free trade network and then easily include the new kids, so. That that would be one of the main benefits here. What. Is in transformation, these things that you're modeling are real valued outputs there are classes, so how. Do you get. Interpolation. So. For each in. The. So. We have in, image sequences we have the the, frames for each frame we get the estimates, for, valence arousal and engagement, which are continuous, estimates, and then, we.
Have The, corresponding estimates. That are labels. That are given by human laborers, yes. So we compute so, that you can think of this as a person. Correlation, but. Only it, takes into, account the. The. Bias, that may exist between, the, two. So. This. So. This measure. Rate goes, from 0 to 100 so. The higher the better so if we reach, 100. There is a perfect agreement between various. You explain yeah, so so, I say so, this is, so. It is the model variance that is, explained. So it is it takes into account the variance between the annotators. Which, in this case the model is van annotator, and the, reality there is another one so it takes into account their, variance so, it's, we want to to, see what is the level what. Is the amount of variance that cannot be explained, by this. And. One. Interesting, observation, here is that when we computed the, between human, raters agreement. On. This, data, we. Reached the, levels between 50 and 55 percent for, all these, measures. While. We the, model that we let be designed we reach 59 percent so it doesn't mean that this model is better the humans the the. Point here that. Is very. Important is that the, model is, being. More. Consistent, with, the ground truth labels then, we achieved to, get. Between the humans so. If you have to human somewhat ating their consciousness levels, would. Range 50, to 55 percent, what. We achieved with this model was 59, percent showing, that we can have. A highly consistent. Estimate. There. Is a large range it depends on the kid. So. I'll show that, here. So this is the performance that we get, on the individual, level which is very important, when arising, the the personalized models and, here. We, have the, the, difference in the inter class correlation, between the. Persuaders. Model and the population, level model and these are the improvements, for. The kids, so. What we see here. So. For. Majority, of the kids we have large. Improvements, in the terms of the inter class correlation. However. On, some and this is most pronounced on the engagement metric. However. What what, happens is that there is a also, the side effect here, that we, have the negative transfer for some kids. We. Underperform, after the personalization, and when. We looked into the date of these kids what we found is for this kid the, model we, either didn't, have enough variety in the data in terms of their engagement levels, so, they were either all, the time disengaged, or highly. Engaged. Or. We had only like, a, few. Data, I mean a few data meaning that, in. In. The in the therapies that we recorded there. Are many cases whether the, we had a problem with missing data so some, of the modalities would completely, be absent so the kid would move that much that we couldn't have. Faced present in most of the frames then. The kids would take take the the, watch, during. The recording so all the physiological, data would, be gone so we would lose a lot of data so and that's one of the motivations I've used out encoders in the input trying, to get to. Somehow, compensate, for this. Missing. Data but, in, the extreme cases we. Couldn't handle that well, but. This is something that is very important and I will try to speed. So. What we see, here, is. The empirical cumulative. Density function, which. Gives, us another. View. At the error that we are measuring here so, our. Error. In terms of the inter calculation, can go from zero to one so. What you're seeing on the x-axis is one - intraclass. Correlation for, each kid and here. Is the empirical, CDF, that we are getting which is the distribution of these errors and. The. Main point here is that for the pursuance model, we. Have this gray line here, which is showing us okay what is the distribution of the errors in. The personalized, model, compared, to the population level model and if. You look at this part here. We. See that when. The, errors. Are very small meaning. When the model is doing well there, is not much benefit in personalization, so if you have a model that is doing, already very well on, that kid or on, this group of the kids, personalization. Won't improve it much and. When. We here a model that is completely, failing on these kids which, could be like we, have a wrong assumptions, about the model that we apply to these kids then, again. It's very challenging and difficult to to, personalize that model to get some benefits with personalization but, where is the most space, for improvement, is in this intermediate range when the model is uncertain, about that kid but with a little bit of boost it. Can perform much better, and. Finally. I would like to show you so, here.
That. By. Using different modalities we. Get, additional. Improvements, so we personalized. Only the face body audio, and physiological, data when. Training this model and this. Is the model that uses, all these different modalities, one. Thing that we can observe here yes there is a very high variance again because these, statistics, are computed on the individual, level and then, averaged across the kids. However. In. Terms of the of the min performance, the. Diffusion. Of different modalities increases. The overall performance of the personalized, network. Okay. And so. What is the utility of discipline, yes, I was, curious about the synchronization. Of different modalities so, what is the important, resolution. So. By. Fusing ultimate elitists. We take the benefit of each of them which. Can be seen from the average results, but. If you look at the individual, modality, or for. Example body, was the most informative for, this task. Across. All three metrics followed. By face and then autonomic, physiology data audio. We, didn't perform, well, here. But. Aha. So what was important, resolution so, for example, audio and physiological. Data, yes they might not be in the same important. Resolution. Right so yes so. Everything in this work everything was synchronize with the frame level, so. If you get like 35, because, 25 frames per second, so everything, all the features were extracted, in a way that would be assigned to this, time stamps of the frames. Yeah. But in, terms of the audio signal we, we used a window that, would capture enough contextual, information around that frame in order to be informative, enough. The. Right level to do this for. You, measures, based. On every our frame, brain, yes, I, mean we can. There. Is a. Another. Work that we did I don't, know if I uh I'll. Try to show a few slides on, that where, we discretize this much levels and then we looked into five seconds intervals which. Was more contextually. Meaningful. For. When. We are trying to recognize discrete levels of, engagement. For example yeah, but here, the, rotations that we were given. Were. Synchronized, given. On the frame level because. The ammeter server watching, the division recording so each frame would be assigned a certain, level of valence arousal and engagement. But. I agree, yeah so the temporal resolution can can, be can, vary and that's another dimension to explore here what will be the optimal window. To capture the meaningful signal more, meaningful signal here. But. So in, terms of the utility so. What we developed here is like a this, real-time monitoring tool that can the therapist can use to. See in, real-time you. Know. From. The camera from all the sensory inputs. What. Is the level of valence arousal. Engages, on the continuous, scale so the blue one is the estimated, one by the model and the red one is the one given by the human therapist. And. What. Is very interesting to this we found this example the. The trend of the. The estimated, signal follows, very well the slope of the of the, of the ground truth provided, by the human therapist which. Is which. Usually happens.
Spend Okay, the, child moves the face but having, taking, into account the whole context, it is still very challenging to detect this this. Changes, in the signal, and. The model successful. Identified, that part also, these, are additional features. That are provided by the, wristband. So by looking, at all these metrics the therapist can a new. Therapist. For example doesn't have much experience can see in real time what is happening there but what is even, where. Is even this more useful is, when. We want to do. To. Have intelligent interaction with the robot we can use these, parameters, to modulate, the interaction, to inform. The robot about the affective States and engagement of the kid so, then the the interaction can be designed in a way that could, facilitate, engagement. But. Also the. Therapy summarization, so, by looking into these statistics, by running, the, when. The therapy is finished so instead of having a therapist we're gonna sit there for hours and go, to this audio visual recordings, we, can have a summary automatic. Summary of these metrics. Given. Like, for each phase, of the therapy and that the therapists can look into this and say okay so the kid was very engaged in the phase one and but. Not in the phase - so I should modify the phase to the therapy so, and the bars that you see here, are. The average. Estimate for, like a one therapy session and. One kid obtained by the model and by. The human coders so. In. The there, is a good match in terms of engagement arousal, in this specific, example in the valence we didn't achieve the two match but, there is a potential to improve. This. So. Just to summarize. The. Reason why I decided to talk about this and why I'm very excited about this work is that this. Is the first work, that we. Did in real-world conditions working. With the therapist and achieving. And. Showing that that. We can build. A system that really works it's. Not giving us the highest. Performance. In terms of the valence, also, and engagement, estimation, but, it gives us some the estimates. That are very, consistent, with, the human laborers. And this, is something that we want to achieve and of course we can improve this but, what is important is that we show here that it is feasible to do this kind of esteem. Machine. Learning estimations. In, real world environments, and. In a very challenging environment when, working with autistic, children, this. Heterogeneity, is. Very pronounced. Then. This. Tool has a. Real. Uses, like, monitoring of the therapy but also the summarization, of attack of, course some. Questions, were raised during the talk so. There. Are limitations the, model is user dependent, at the moment however. And. Static. It doesn't take the temporal information into, account but, by changing the network layers from.
Static. Layers to LS, TMS we cannot we, can deal with that. One. Of the because, the challenges is the negative transfer, is how we prevent the network of, downgrading. The performance, of the population, network when, trying to generalize to the new kid. To. The personalization process, so. I'll. Stop at this part here and if you have a few. Minutes I, would. Like to just. Take. A step back and just give you a big. Picture about all the work. That. About. The big picture about very CD's worker going like in the future so, this. Is an example of an AI system where. A general system where we have steps. Like we start with information, sensing we, do the machine learning and then we, perform some intervention, or interaction, here, by providing, the by using the machine learning outputs to provide the personalized therapy Nadi's or. Have. A more socially intelligent interaction. And then we, are in this loop and the. Whole idea is to have this system that is constantly, updating, over time and. For, this if. You're not constrained only to autism data, this can be applied to this, kind of approach. And. Like. Algorithm, can, be applied to any kind of the date of course they wouldn't need to be customization. Adaptation, of the layers the natural structures, and so, on but, what, is, what. I want to communicate here is that this is applicable to too many different types of data, usually. We have the metadata that we have like the prior knowledge about the users then, we can use any kind of hardware. That would allow us sensing. Of the behavioral, data, when. The at the top we have the self reports expert feedback, user, feedback. So making. Using. Machine learning to arrive. From, this passively observed data to. The metrics, that we want to achieve. Can. Be achieved with these kind of models, by, designing. The structure and allowing. Them, to personalize, the. Interpretations, for each individual, and of course this is not only constraint to the behavioral data it, can be applicable.
To Many, other data. For example I. Applied. The similar methods. To. Alzheimer's. Forecasting. So when we have the clinical date of the patients so in. In, the case of the Alzheimer's so, we have the clinical data of people go to the study and the idea is to try, to forecast. Whether this person is going to convert, to Alzheimer's or not over. Time, but. There are like. The main goals that you want to do to, achieve here is to empower the user to, protect the user for. Example in the case of like, people, here who'll. Even use depression or people who have the. Risk of heart attack so these, are the cases where we really don't want to, leave. The models, to. To. Play with the average performance we want to make sure that we are doing very well for each and every individual, because in, the case of the human robot interaction if. This robot fails to estimate the engagement provides some average estimates is fine the interaction will continue but in these cases where, people have, suicidal. Thoughts or, are. Suffering for the heart disease we. Need the models that are highly, accurate and personalized. And. Finally. Some in the future for the future direction so, one, of the limitations this model, is that this, that. The learning is stopped after the. The. First round of training so how we can actively, learn in. Order to achieve this continuous, personalization, then. How, we can scale up in the achieve data while. Assuring the data privacy, and, finally. How we can efficiently deploy this model and enable. Them to explore. The context, in which they are applied because, engagement if on coders can have a completely different meaning from another context, and. So, are there any questions at this point because. I would need only a few minutes to to, show you one very interesting piece of work if you have time. So. I guess I was just wondering if you think about going. Beyond. Sort.
Of Aggregate performance, over. The population, then it's not just about whether your model. Can. Have, parameters. For every individual or not but it's also about what what kind of metric you're optimizing right and as far as I understand you're still optimizing. An average, loss. Across. All the, users or, is is, the loss function also changing, in some way to reflect. That you really want to get the problem right so all the metrics that I, showed, there were. Averaged. Across the users and the variance was, across the users, so, I mean, the metric, that's actually being optimized, right so I'm assuming, you're still taking. This, network. And optimizing, something like it's cross. Entropy loss, or mean squared error in. Prediction. Across, all the different, individuals. In your data set right yeah, yeah. So so at the moment yes so these metrics are still. Metrics. That are like a use, for, the population, level models the only way how they're summarized, is different, but, I agree I agree so, there should be like, in, an ideal case we, would assign, different costs for, example for mislabeling engagement, for this individual, then for another so. That, there should be like a yeah, cocoa, sensitivity, that is also. Used a specific. Yes. Just. A clarification question so you when you're saying, user do you mean that gender the therapists because I thought in your problem setting so. In, this case I'm referring. To the kids yeah. If. I so. The kids in this case, are can, be thought of as users of the robot so. In. That sense. Durable motorized media engage with the kid and the kid is a user. So. This is the work that originally, done with the Cynthia, brazilís. Group personal, robots group in. Media Lab and we, said okay we have these models but now we are, working towards putting them in the, trying to incorporate, this perception module, into. A real robot so that we, can estimate engagement. On. The fly and modulate, interaction, we drove with the subject we are working on now, the. Model, that we designed, for this is, so. This is just to give you a context, about the data so there. Is a learning activity so, the kid is interacting, with the robot and. The robot is here, as a tutor and, learning. Companion, so. We can 43. Kids, that. We. Recorded for 8 sessions over 3 months, and. For, this we have discrete, coding of the engagement, levels but on the 5 seconds intervals and this, engagement levels are defined like a low medium and high. So. These are typical kids and, the. Main idea is that this robot that we enabled is trouble to recognize their. Engagement, level so. That it. Can adapt this, whole, interaction. So. Not. At the moment because we are now putting. These modules into the robot so that we can use, this data we. Are using this data to learn from the previous interactions. How. To escalate, the engagement, so, that in the new interactions, the robot can start, off with those models and adapt, as soon as we have a, few sessions. Going. On so, and this is the model that we propose, here so we use the framework of reinforcement. Learning for. That to, do the active learning from, from, the data during the interactions, so, we have this like five seconds intervals as input to the network that, we process. Using collision neural networks and then, this is the most important part this is here we do the active learning so. Views the lsdm, layers fully collected layers, that. We model, within. This window each of the, frames and then, we fuse their outputs, to, get, the. Action in, the output of the model but the. Action can, take different, values one, is I'm. Uncertain about this video, segment, and the needle label so I'm not going to make an engagement prediction, or if. I'm certain, about, the. Content then, I'm gonna. I'm. Not gonna ask for the label but I'm going to output the engagement, level that I think that is. That. The kid is having. In this specific, video if. The label is requested, it is put in the pool and then, after. The session, me. The kid is finished and, the teacher looks, at these videos assigned the labels and then, before the next session and we update the models. So. That if you have a more effective model in the next interaction they, di use it to several, interactions we. Can converge, with these models and have a fully personalized, model. Your. Learning of personalized, policy. How. Fast it is. Process, so, the. Like the most of the time of the learning process, takes. Takes. During, the learning. Of the group policy where, we need to have the data of training. Kids. Like. A student. For assume. That. The model is initialized, the using a group policy, yes. The one reason is using the group policy which, is learned from the training kit so we split it's 40 tickets into. 22. Kids for, training, and 21. For, testing the, group policy is learn the using supervisor. Yes learnt. That using the supervised learning. During. The training stage but then when you have a new kid coming.
In, In. The test stage we. Applied the, learn policy, and then over time as more data we are getting from that new kid. We, are fine-tuning, the policy to that new kid. Unfortunately. I will not go in so. Through. These equations, just to mention so the laws that we use is sudden reinforcement, learning loss system memory loss and, we then, we minimize. Where. We minimize. The loss between the expected actions and the actions that would minimize the future rewards. And. Those rewards are defined so if, the model predicts, well it's past 1 if the model predicts, gives. A wrong prediction of the engagement, level is minus 1 and then, if the model is requesting, a label that is a penalty for that so, this, is this. Reward is modulating, the learning process in the, model but. The. Most. Interesting part here is the adaptation of the model so this family this group policy and, then. From. This whole. Interaction. Session, we, identify, what of the segments, that the model is a certain about we put them in the pool the. Human auditor. Looks at this and usually, what we found from, the is that there are only like, 6. Percent of the. Clips that, are that. Model requests the labels for so, which means on. In. The case of. Like. A fee, in minutes sessions. Only. Like around 1 minute. Right. Actually it'll be. Yeah. Around 1 minute of the videos would need to be manually annotated, which. Would take like five, minutes so human coder to do this coding so it's a very efficient way to to. Easily, adapt, the model to the specific, kit, and. Then we proceed, we, update the policy, we apply to the new session, then. We get the new videos and so on over, time. Yes. So the robot is a designer but at the moment is not using, this estimate, to. To. Interact, with the kit and this, is a study that we are organizing now, where. We are putting, this module and then in real time testing. The robots. Responses. Is. Only, liquid, the, robot can predict their engagement, level but there's no. Maximization. Of the engagement, level there's no reward that the robot is more geared. Second. Select, receive ice but. Yes. So if the robot. Its predicts. Engagement, correctly, then, it gets. This. Is another. Type of policy, so this is the, policy that, is concerned, about the interaction, so, whether this, interaction. That is designed. As. Part of this tutoring system is engaging, or not for the kid so. It's a completely different part so the focus here is on the machine learning part where we are trying to, to. Just get the sense of how engaged the kiddies.