now I will turn it over to Bill Bernett executive director of The Design Lab to introduce today's speaker thanks Annette that was fantastic um I'm really excited to introduce one of my colleagues in the Design Group uh Monroe Kennedy is a assistant professor of mechanical engineering in the school of ing he also has a courtesy appointment in computer science he's the recipient of a National Science Foundation career award given you know to a very few number of people for excellence in uh in engineering and science research um he runs the assistive Robotics and manipulation lab we call us the arm lab which is a lab dedicated to developing technology that improves our everyday lives by creating intelligent robotic systems that can both perceive the our environments and figure out uh what humans are doing in these environments so that humans and robots you know can uh work and live side by side um monro's lab will be one of the labs We Tour uh if you come to the emerging technology um Workshop that's coming up in March but we're super excited to introduce him um he's got a fantastic conver uh presentation here on can machines understand human intention can machines even understand anything I guess is is an interesting question and in in the world in in the modern world of AI robotics autonomy and other other kinds of robotic systems I think this is really one of the most interesting topics and then we're super excited to we have a chance to bring it to you um Monroe um why don't you to take it away all right thank you both for that introduction um it's a pleasure to speak with you today uh so as mentioned um I direct the assisted Robotics and manipulation laboratory uh and one of the guiding questions that we have is uh what comes after robotic autonomy uh so if you think about robotic autonomy that's the ability for a robot um to be in an environment to understand itself and its task and be able to complete that effectively uh but what comes next is thinking about how that robot might need to team with humans or other robots uh and to do that effectively requires modeling um those human teammates um I have this figure here um shown on the left that outlines how collaborative robotics fits into this larger conversation of robotic autonomy so uh when you think about the principles of robotic autonomy a robot needs to be able to do three things it needs to be able to see think and act um formerly the that's perception how do you observe the the world through a camera through the sense of touch or other mechanisms other sensors and reduce this down to a state which is a collection of variables that change and based on observing their change you can say something about how the world is evolving and then given what you observed uh in your environment how do you plan on what you want to do next in order to uh achieve some objective and then how do you control how do you actuate yourself how do you move how do you uh provide some input that changes the state of the world so that's the cink ACT perception planning and control uh in the larger context of how robots might be teammates some of the perception may include uh observing humans in your environment so how do you estimate their intent uh so something that's very popular nowadays are language models uh and that's a very um uh uh adequate way to obtain intent from those around you so perhaps I can speak to my robot and say hey can you pick up uh this cup of water on the table and maybe that is a a valid form of input but uh in my lab um and I think another interesting question is when things have to move very quickly it's sometimes not always efficient to have to use sentences to describe what you intend and I would postulate to you that when humans work together we also don't always use language maybe we use it a lot but we don't always use language um to convey our intent and uh one of my favorite examples is particularly in sports so if you have people working together uh in sports uh you know take soccer or basketball they spend a lot of time working together practicing uh so they understand again how they want to achieve their objective like scoring a point but they're very familiar with the abilities um and the signals that each other are able to uh to convey and so given the abilities of your teammate you can exploit that to actually for instance score a goal and so the ability to understand your teammate to be able to predict them to be able to read them is crucial as a tactical Advantage um in sports and this notion of how do you estimate what your human teammate would need if you're a robot is what we're trying to capture and we need to express that mathematically and that's what we're going to get in today how do we express that mathematically so I've I've highlighted here on the left hand side um uh three of these uh main points in red which will touch on in more detail the intent estimation how do I observe a human and pull out a state that allows me to estimate uh uh the variables that are changing how do I use that state and knowledge of the tasks that we're trying to do together to predict how my actions May influence the human and how their actions May influence me uh and then with that prediction how do I choose what actions I should take that will lead to um a better outcome for the overall team so to start this conversation off um we're going to build uh some tools first we mentioned how do we need to mathematically model uh the ability to estimate intent um there are a few ways to think about this um but the most popular way right now uh is through something called a generative model uh this is not necessarily A New Concept however some of the tools um that are uh used to calculate it nowadays are very hot uh and and uh very capable and to start off here um I want to begin the conversation with um a tool that most of you are probably very familiar with uh Chachi PT so um I'm sure we've all had a chance to try this out you can go to chat PT open aai uh you can put in a prompt and you can get a very good um personalized response um that most of the time uh is pretty much on point um with what you might have uh inputed um as a as a query so here uh is an example I've put in here how could humanoid robots help Society what are the biggest challenges for robots doing everyday tasks uh and uh some of these answers I completely agree with um highlighted here on the right you know you can uh provide assistance for those with disabilities there's Healthcare manufacturing manipulation and and perception are the challenges uh did a very good job um of of pulling uh that answer very realistic you know on point um beyond that you might see applications uh for actual technical assistance so let's say you have um a wearable sensor uh where you want to use an iPhone to do some sort of pose tracking of a camera as you walk through a particular space you can ask how you might use um camera data from the Swift which is the coding platform for Apple to stream that to a computer that you might connect with other robotic components through Linux and Ross uh and then you can see that this can give a really good um uh output um that again is typically about 80% correct so if you kind of know this General coding structure this can really get you moving in the right direction quickly uh in order to um begin implementing these strategies and of course with uh gp4 you have the ability to take a picture um of some um website uh and or or input and then it can actually take a picture a rough idea and correlate this here to ual HTML code um and other things like this you can use images to also generate text so uh what's going on under the hood here and what are some domains for this type of generative modeling um so some of the examples I've given you just now uh include text generation you put some text into this model and text comes out um diffusion image generation you can put text in and an image may come out uh and even um uh music you can put in music signals and music uh will actually come out with these different um tools but before we get into them even more um I'd like to lay a very highlevel quick mathematical foundation and I find that this example here um is very uh digestible so I would like to you know um uh invoke your imagination uh imagine you're a fisherman and you're trying to determine uh where you should sail your boat uh to find some fish and you uh notice from past experience uh that it might be helpful to determine uh where the birds might be flying to actually drive your ship to that relative location so here on the left um we see our Birds I'm going to call them an event X the presence of the bir Birds and the the the fish that are under the water will call this variable y um and our goal is to determine the relationship between the fish under the water and the birds that are in the air so um uh please excuse the fact that this is all on one slide uh but if you go to the very top line here we have this P of Y and this this upward line and x uh and this is telling us uh what is the probability that we will see the fish given the presence of the birds and we'll call this a posterior probability um and we can find this if we know a few things um statistically if we can say from prior observations we've seen uh in past past experiences that when there were fish how often were there Birds uh that will give us P of X conditioned on why what is the likelihood of birds given there were fish and then if we have some fundamental knowledge of the species so what is the probability that the bird species is going to be in this geographical location given this time of year have they migrated or or what's their likely presence of being there because there could be Fish And No Birds simply because the birds have migrated and likewise for the fish there could be Birds but if the fish migratory patterns uh are known then we can use that knowledge as well and this leads us to uh a very fundamental equation in machine learning which is BAS theorem where effectively we can use this prior information about how uh birds and fish relate um and the uh fundamental knowledge of those species to then calculate um this posterior probability what is the likelihood of seeing uh fish given there were Birds now I won't get too mathematical on you but I wanted to give this uh foundation so that you kind of have a handle when we talk about these tools and base theorem is extremely powerful as is statistics generally because it's the art of of of effective guessing another uh uh quick uh intuition I want to give you here um is this idea of do we expect a single outcome or potentially a distribution of outcomes so here on the on the left hand side let's say we were we had a bucket that was just sitting outside um and the only way this could fill up is if there was potentially rain uh so if there is water in the bucket uh then we can um assert that uh there was at least at some point a cloud that filled it up however uh on the other side if you look on the right hand side of this image if we see a dark cloud that potentially has has rain and I were to ask you what is the likelihood that this bucket is fill uh filled or not uh well you have a 50-50 chance um has it begun to rain or not um uh to fill this bucket up but given this Cloud statistically in the past you may have seen that this could have 50 50% led to um some outcomes so you have this distribution of outcomes versus a single outcome that you can infer um and a lot of machine learning uh machine learning principles uh May rely on one versus other paradigms of thinking depending on what you're trying to do so perhaps you're trying to learn a relationship between variables where you say if I had this outcome or this variable x what is the likelihood that this other thing happened that's a regression model um but if I want to realize that there are multiple things that could come out of this a distribution um that's this more stochastic uh stochastic uh uh representation that I might want to to capture what you see here on the right and both of those are useful depending on your task so to kind of again uh build some idea of what an idea of the tools that are actually used to do this in the wild um I highlight these four types of models uh one is a probabilistic graphical model an autoencoder Model A generative adversarial Network and flow-based models um the takeaway for these um that I want you to have is there's this variable X um shown usually in green here or blue uh and then this variable Z and what's Happening Here is this x is some observation of the world um it could be you know a camera image um or some other data and you want your model to understand the key features of what uh that information um encodes and you take uh an encoder model and you compress it to this what we call latent space um or the code and that's Z and this is an efficient way for the computer to think about some underlying relationship and you use a decoder to then say how do I go back to um some representation of that state X Prime where I've pulled out the key features to understand um what was important um how do I uh perform perform Salient feature uh extraction here and so again it's saying I'm giv raw data I compress it to what's key and essential and then I can decompress it um uh back to um what was useful the big takeaways that I wanted to have and so these are different ways to do it um and we'll talk about um some of these a little bit in more detail but the goal is really to just give you again some um some knowledge of the jargon that's used in this space to accomplish these types of things so what's under the hood um so the cha which we all know and love uh is using called something called a Transformer um and thinking back from that last page there's three main components there that are going inside of it there's a variational encoder uh recurrency which is allowing you to perform a sequence prediction and then attention which is telling you what's the most important thing um in the sequence of inputs that you should be paying attention to so I first want to bring your attention to this middle uh uh uh image here I like science um and the goal is to transform this into German so you have an encoder that is able to look over that sentence um compress it into a code uh and then um uh with a a a tag um output this um translation in German uh that's the uh sort of variational autoencoder structure coming into play what's the most important piece here how do I compress it and then output it into this other language where you would have the decoder related to for instance German or whatever other language you would want recurrency says you know what kind of sequence how does the past words influence the future um so you know the boy rode his bike to the park is highly likely because of the fact that you had those words in front the boy rode his bike to had high likelihood of a destination being an outcome that's the recurrency attention means maybe not every word that you was spoken the boy rode his bike maybe just boy Road and bike were the most important things you needed to actually infer what came next so attention says what are the most important things in the sequence I should pay attention to to infer a particular outcome another is this idea of diffusion and so it says I can have a relationship between a very crisp state so here shown in this middle uh right X knot and this relationship of noise and the name of the game is to think about how I could make this noise these variables that are uh uh uh positions that are random in position how could I make them converge back to a Salient state that actually makes sense so he here you see this noise condensing back to this s and the name of the game here is if you think of all of these points think of them kind of like sand if you had like a sand Shaker how would those sand shakers move back to their relative positions that would actually allow you to reconstruct an S and so the modeling here says how should every Point move along that space along the plane to get back to its relative position on the S and whatever you condition on determines the shape of the ground that would make that sand particle move back to its position in the S formation and that's what you condition on so in this uh example for text to image I say you know Pro produce me a group of of uh robots playing the violin and then given that as a text input it then determines how to take a noisy image and condens it back into a Salient realistic looking image um that captures what you conditioned on which was the text and again there's this latent structure that's coming into play um that describes the gradient of how those noise particles should conform to give you back a Salient image um uh and in this form another example here uh is music generation so same idea I can uh put in a signal here um hear the beginning notes of a song and then I can say you know given all of the music I may have exposed this model to what do you think should come next and again um perhaps elements um of uh you know features of this song are being pulled out to actually help you extract and reconstruct similar um out uh um output music so uh I think a very important question here you know we see these tools and we wonder um how powerful are they right is this the dawn of you know like true machine intelligence um and I think these tools are very powerful very useful um but I think a very important thing we need to keep in mind um and what may actually I think still differentiate in some ways you know what humans are able to do versus robot is this idea of of original thought um and so um I'll postulate this example to you and then I'll I'll I'll make it more high level and intuitive as well so um again uh please excuse the fact that everything's on the slide at once but uh imagine I had um a sine wave uh here so X is your input Y is the height of the wave um and the blue region you see here is all you originally expose your model to so um over this span of 20 points the sine wave uh doesn't drastically increase um in its magnitude it's increasing fairly slowly um so you know it would be a perfectly Fair assumption to assume in the blue region that all you would need to predict the Y um is just sign of X however um clearly we can know with our you know external view here uh that uh that's not a complete uh view of how this model um would behave and in fact um it would be better um to actually be able to extract further to have something that would predict um points that are far outside of the region you were trained in um and so this is the challenge right how do you say you know if I expose my chat model to a bunch of verbal examples the internet as it exists today and I go create something new it can interpolate what currently exists but if I say create something brand new um it won't be able to do that so the very easy example I put here in the bottom right is if I asked my uh chat GPT prompt you know what's a compact or portable Fusion reactor design uh it'll throw its hands up and say I don't know now does that mean that this is unknowable knowledge I don't believe so uh we can look at like Lawrence Livermore and the advances that they have I'm confident in the next few decades we will see a solution here so if there's if all of the knowledge to do this currently is existing or being built or could be built up by us why can't chat GPT do this yet and it's because it's interpolating from what currently exists and it's not extract stating beyond the training set beyond what it was exposed to so how would we achieve that um what are some high level thoughts that it would take to actually do that well you know to achieve this you have to be somewhat self-aware and say well uh even in the region that I'm exposed to um how do I make sure that I'm doing a really good job of explaining that behavior everywhere and maybe for some parts that are really far away if I could gain access to those components could I make my model uh actually uh perform correctly and predict those things that are very far away um I have a passion for physics so I love to use this particular physics example um how you know even aspects that um physicists are able to do to uh explain uh you know features very far away like you know a few years ago we were able to get our first picture of a black hole it took the entire planet coming together to effectively create a camera that literally spanned the planet to get this picture of a black hole and the equations that described this were those that described things on a smaller and smaller level so we go from Newton to Einstein to even smaller uh uh equations that are more compact that as we understand our universe in a more fundamental way actually extend and allow us to make predictions that are even further away like a black hole I use that as analogy here uh if I were to look at this sign function and I were to say Okay I want to have a description that is accurate everywhere I need to um be able to model the most compact form of this uh of this equation and have it be applicable have it be accurate even far outside of domain as well as within domain and so uh I believe the the path forward for that is actually models that understand contextualization and are able to modularize their components and play them out and understand how they would affect each other how they would behave even far outside of the the domain they were originally exposed to which is I would argue how we conceptualize things we think about Concepts we form hypothesis by saying how these Concepts should play together and then we test them that's the scientific process so um changing gears a little bit here how should we uh what should we consider uh when constructing a model so here I've I've drawn a simple uh pendulum on a string here where the pendulum uh has some mass and it's moving mov back and forth it makes some angle Theta with the uh ceiling um and you can think of the state the most efficient way to represent the uh changing variables of this uh pendulum as the angle and the angular velocity if I have a camera that's looking headon at this pendulum as it swings I could try to extract the state um from this image and so the key uh uh relationships that I need to be aware of here uh include the observation so if x is my image this is what I'm observing and I'm trying to um U predict uh and understand that relationship I can say how is my image uh dependent on um the current state the Theta Theta Dot and if there's any actuation if I have a motor or something that's uh forcing some motion uh causing some action how is my output image related to those variables I can talk about the Dynamics how do things change how do I expect the uh position and uh velocity of this pendulum to change conditioned on the past position um in any action that was taken if there is a decision maker that's changing the state of my system what's driving their actions and this is in the machine learning uh Community known as a policy um in the like analytical controls area this is um a controller description and so um I represent these in a probabilistic framework because it's a sort of catchall framework whether you're dealing with analytical equations or even um different types of representations and so this often falls into something called um a Markoff decision process or a partially observable Markoff decision process which describes how a state changes and everything that uh May contribute to that change of state and how you might observe it the other key uh thing here is when you begin to try and think about distributions of outcomes um you literally have a statistical representation here like a gauen and so if I have this uh shown in the middle here some P of theta which represents my models distribution and I had some training data that I exposed you to uh which is this green uh cue of sigh um the goal is to match these distributions and again let me ground this for you um imagine um I have uh a person that's driving around a track and I want to do a a physical Turing test um I want you to if if you're in the stands I want you to tell me if a person is driving that car or if a machine is driving that car um how would you determine that well perhaps if you watch the person um they're super efficient at this they do some drifting but you notice that their motion is not exactly perfect while they're fast the motion that they they perform is not exactly perfect whereas perhaps if I showed you a machine driving around the track it would actually look like it was flying by a wire perfect execution every time and that's how you would determine that a car or an Optimus vehicle was actually driving on that side so if I wanted to fool you the task I would have would be to add enough noise to the behavior of the car so that it would mimic that of a human so you couldn't differentiate between the two and so this is just an example showing how distribution itself can be useful for um mimicking um aspects of behavior so uh with that um I'd like to kind of move into uh more specifically this idea of teammate prediction and then tie this into some of the work that's uh currently being done in my research lab so some questions we might have when you think about um robots and humans um working together is how do they conceptualize each other so here uh in this cartoon in the center I've drawn drawn this image of a robot thinking about the team um and it even has a little thought Cloud for the human um this is formerly known in my field as uh uh u a theory of Mind of the first and second order the first order says I just conceptualize uh myself and my teammate second order says I'm also trying to think about what my teammate is thinking about and use that to help me make decisions and plans so um here are four big questions you might ask yourself in this space you know do agents share the same goals and and objectives who if anyone is in charge in the team if the robot is predicting multiple partners um and their future actions how can this be done efficiently and where does task domain knowledge come from where did you learn to do the thing we're asking you to do again our trusted tools come back to play uh here we see the conditional variational autoencoder uh generative adversarial networks flow-based models all of these uh can be useful tools among others um but these are very popular right now uh all of these can be very useful tools to try and infer and predict human behavior given the exposure to their past Behavior as your training data so I watch a human do something for some period of time I collect these observations and then I'm trying to statistically match that distribution so that my robot can predict um with some likelihood what might happen next realizing that teammates are stochastic so we don't expect to get perfectly right every time but we want it to be very close within distribution of what uh the teammate might do uh this uh provides a very um a nice description too of how this might work um in a formal framework so here you have this robot and human team they share a mutual goal and objective they both are able to take actions on the world or the task this cause this causes the task uh to change um and we can measure this in terms of some sort of State Evolution the agents um human and robot would observe that change and that would help them infer what what future actions they should take and we're postulating here that if the robot is able to conceptualize all of these components the objective the policies for both agents how the task will evolve and how both agents will perceive the world then the robot has a chance of rolling these things out um in the future in its mind and saying how does this maximize some measure of benefit or reward and then I can then say if I think of this stoas stochastically I could look at many different sequence of action Pairs and then determine uh what action I should take next that maximizes some potential output allowing me to be a better teammate uh with the human so now I I'll quickly show you a few examples uh in my lab that demonstrate this um I have a few videos um that won't play through these slides but if you go to my website arm. stanford.edu under research you can actually see some very cool videos for all the things that I'm telling you uh right now so um in this domain um this first problem we had was thinking about how a robot could substitute um and serve as a teammate alongside a human so um the concept is shown here on the left so we have this robot that ultimately wants to help carry a table with the human um and what it does is in this subth thought Cloud you see this ax it's able to observe two humans work together so we expose it to data where humans work together extensively and from that it says okay I'm in a very similar situation to what I saw two humans in what would a human do and it uses those statistical models we discussed to understand all of the possibilities that it was exposed to given a similar placement that it's experiencing and after planning and looking at all of those potential outcomes in uh part C it can say this is going to be the outcome I'm going to pursue most because I think it will maximize my reward of getting close to the goal quickly without striking one of these red obstacles uh shown here on the right um is the uh methodology um by which my uh graduate student eling achieved this uh so here the observation buffer is all of the human human data that we collected where humans played this game together in a simulator uh we then compress this into a variational current neural network which outputs motion predictions expected paths that are similar to what humans demonstrated for us we then select what we believe to be the optimal path given the distance traveled and other parameters and then uh we tell our robot to execute that path alongside a human um and we observe the outcome um of that uh shared trajectory and so um we showed in our work that if you compare this to um RR uh which is a way of saying let me just explore random paths and try to make sure none of those random paths I roll out don't hit any obstacles and eventually reach my goal compared to this method that's informed by what humans demonstrated for you we're able to show um that um the vrn um our method actually produces much smoother outcome paths um in the end additionally um you might say when you're doing this task uh maybe you followed uh um an efficient path but how do you know necessarily um that they were more collaborative than if they were to pick random paths that didn't necessarily hit an object one way to think about this is through something called interaction forces and so if you were playing tug-of-war tug of-war is literally a game of inducing stress in a rope because there's disagreement one team wants to pull the Rope to one side the other wants to pull the team to the the Rope to the other side there's these really large tension forces inside side of a rope in the game of tug of-war um however if I were to pick up an object in my hand um perhaps I am applying some forces just to localize it but I might not be squeezing this with all of my strength because there's agreement between my fingers on how I actually want to move this object so uh one measure that the field has agreed upon is this notion of interaction forces how much stress or compression are you applying to the object you're carrying um and how can that be used as a measure to measure consensus between the agents involved and so what you see is um if you have two humans carrying something together the interaction forces are low they're not zero but they're low and um if you use our vnn they're also pretty low if you don't and you use the sort of random path planner that shouldn't Collide um you'll see that these interaction forces the tug of war is quite high it was hard for the robot and human to reach consensus another really cool aspect here um is this notion of the Turing test so we did this experiment where we said okay we're going to set up a barrier and we're going to have a participant uh work with um this simulated uh partner in the carrying task and we're going to have them do a bunch of Trials and in some of them they're going to be working with a robot in others they're going to be working with a human and we want to know if they're able to correctly identify who they're working with based on the behavior that they experienced as they played the game um and so what you see here on the bottom right are the results from that exper experiment blue is the sort of random um uh path generator which we don't expect to succeed very well and vnn was our method that was informed by watching humans work together and um this uh xaxis um gives you um what the human thought versus what it actually was so if they thought it was a human and it was a human they got that correct if they thought it was a robot R given that it was a robot SLR then it was correct and what we saw was with our method we actually did a decent job of confusing people um when uh we actually um had uh the vrn system and this is just a violin plot so this says all the data that we took what's the distribution of that um in case it wasn't unimodal um and you'd like to see the true distribution um my student extended this work and I encourage you to again check out her really cool video where we actually said now what if we used an even more advanced uh method to en capture and Model Behavior um and so for that we used a diffusion co- policy and it turns out that when we expose this diffusion co-p policy to uh examples of human data and then we transported that to the robot the robot began to exhibit um very interesting highlevel behaviors um in particular some that you'll see in her video includ includes the ability to serve as a leader and then transition to a pivot to then become a follower so that the uh table could efficiently navigate um the obstacles and when we saw that that made us very excited another project I'll quickly mention um uh in this space uh uses the same type of tools but applies it to a completely different domain um and that's the intelligent prosthetic arm so in this instance imagine you know someone has had a shoulder disarticulation so they don't have their entire arm and we want to endow them with either a prosthetic arm or a teleoperated arm so that they can do complex manipulation tasks again there's two ways to think about this one where you say well we know all the tasks they might want to do like drinking from cups picking up a fork and then there could be an instance where you need to teach them how to do a novel thing how do I teach you how to um how to move the hand in order to complete a task and it turns out that this same type of framework where you say given um the tasks that are uh there how do I extract their intent from their gaze if I have them wearing a headset what are they looking at I could use EEG intracortical to understand their brain wavs how that might give me an idea of what they want to engage with their EMG so mus any muscle actuation they're able to provide how can I condition on all of these inputs be them Limited in order to control um a high degree of freedom um uh uh arm another I'll mention to you is this fall prevention wearable sensor um so in this domain uh we think about uh particularly uh what might lead to imbalance um in a person and it turns out that this is um actually a leading cause of fatal and non-fatal injuries in older adults those were 65 and older um and the question we had is realizing that uh Falls can be C can be caused by internal things like like illness and psychological factors but they can also be caused by external things so if they bump into something or something bumps into them um we're saying well what we can do right now is think about those external factors and particularly how they're walking in an environment so a passive perturbation you're tripping over some steps you're bumping into a wall um or something can we understand given how you've walked and watching a whole bunch of people walk can we predict how there are features in your environment that might potentially lead to um imbalance and so my student um he made this initial wearable sensor that's able to uh includes a camera that's mounted on your torso um Imus that are located on your legs and the idea here uh is that using these he can observe the environment um in order to use a machine learning model to accomplish this so now we're back to our problem statement and while we'll use some fancy models to accomplish this it's very useful to formally say what our state is and we can represent our objectives in a probabilistic manner so here our state includes what's the pose of the person where are they in space what's the velocity of their torso if their uh motion of their torso if they have some natural Sway and Cadence can we actually predict that sway that Cadence um and then um monitor that change in case that might inform us about potential imbalance can we watch uh their step frequency and the joint angles in their legs and so these two probabilistic statements says um given how they've walked so far and the environment can we predict where you're going to walk in the future and given what we've seen so far and where you've walked can we predict how the environment will change around you so the first challenge we had um for this was how to represent the environment in a way that a neural network could digest and use for prediction so for our first step for that we use this uh sensor that was mounted on the Torso to collect images and depth images um from our surroundings and what we did was we created a panorama so if you see this person in the middle of the screen at the bottom there's this cylinder that encapsulates them you can think of this as a a cylindrical image where the red line can be cut and opened up so that it becomes a rectangular image you see in the top right hand side so whatever is right in front of you is a small area right in the center of that image which you see as the second row image on the right and if you're walking Straight Ahead um then that's the red path um drawn in this image where your feet are at the bottom of the image and then you would walk up in front of you based on what you're able to see and so this was a way that we could actually efficiently represent the uh obstacles around us um uh in order to um uh provide that information to our model the other thing was this notion of Sway covariance and so by watching the motion of our torso we learned um that in fact uh when someone is perturbated um if you watch the statistics of how there if you had a a z-axis pointing up straight up and you projected into the ground plane um there's going to be some natural uh motion that makes this sort of ellipse this covariance if someone's perturbated this ellipse grows very sharply and that growth can be useful for predicting that something caused perturbation and perhaps that might be uh something that would cause you to fall so this is our machine learning model um that we ultimately ended up using to perform this prediction um so don't get overwhelmed by the details I'll highlight what's important so we have the images um uh which created the depth image for our surroundings as an input it went through this Auto encoder and was represented in a latent space that was a way for the network to efficiently model what was important in the environment as you walked through that environment and and then we had our state we had the odometry where you were walking and Joint angles and we went in through this lstm our long short-term memory which is one of those recurrent networks so it said given everything that has happened up until now how can I predict the next few sequences of what will happen conditioned on the past uh and then using this It produced how it expected your walking to evolve and how it expected your uh environment or your latent state to evolve as well and so we walked into different places around campus um and we collected measures for uh a a prediction Horizon of 7 seconds on how well um we could actually predict where you would end up um and so what we learned is if you were indoors where it was kind of like tighter regions it was actually um uh really good to predict within about two meters where you would end up so uh in these drawings the red line is the average line the solid line is the average line and then the color shaded area is the co variance associated with this so the top row is indoor the second row is um outdoor cluttered and the third row is outdoor free and so what uh the point I really want to drive home here is we could do a decent job of predicting where the person would go but there was this variance and this variance doesn't necessarily mean our model was wrong it just means that perhaps there were many solutions that ultimately would occur and if we only predicted one um we may not exactly be correct because there's this distribution that a person may walk that all would be valid choices a really cool feature and I encourage you again go check out this video um was actually the fact that we could do a decent job of predicting behavior in environments so here is a picture inside of our psychology building and um with these sets of images you'll notice on either side of the dash line there's on the top row a vision model and a no vision model um for time segments A B C D E and F um if you look at the r for vision model the blue line and its ellipses represent where you walked in the past and the suway covariant at those instances the green line uh represents where you actually ended up because we recorded all of that data um and we then post-processed it to to uh to do it when we then validated it and the red line with its ellipses indicate where you predicted you would end up um and where you how you predicted the suway covariance of the person and what you'll notice is if you don't use Vision um the model just assumes inertia it assumes if you're walking in a particular way you're going to keep walking because that's all it has access to but if you incorporate conditioning on the environment and pass data of people walking it begins to learn that it can't walk into walls um that there's likelihood of you turning down uh walkways when it's available and then given how you're orienting your body you may walk in a particular way and given your speed You can predict how fast you'll walk in the future as well another really cool feature here was as you walk in particular environments there was actually a correlation in your sway covariant um as you walked in those zones so here's a picture of my uh graduate student he's walking around campus uh and what was interesting was when he took sharp turns his sway covariant turned uh changed in a particular way and after being exposed to that type of data uh the model was actually able to to predict the motion of his torso given the path that he was going to walk um and so again these are some really cool videos I encourage you to check out you can actually see in real time that motion of the Torso as well as the uh sway covariant as they walk through an environment some future directions for this well we have that covariance that wasn't perfect so what if we wanted to predict multiple trajectories and uh perhaps the sensor system that we showed you is not cool enough for people people to wear uh it would be better if it could perhaps be a smartphone uh and then down the line we'd like to also actually help populations that could use this so this could be older populations or people with neurological disabilities could you outfit with them with this sensor collect their data and then make this system work for them and perhaps use this also in lower limb exoskeleton control which my lab is excited to collaborate with Professor Steve Collins here in the mechanical engineering department at Stanford as well so we've implemented this on a smartphone we're actually able to get really good uh depth representation data um my student is also trying to improve the ability to estimate sway covariants um we have this software called 4D humans that's able to extract the pose of the person um and then we can actually monitor their sway as they walk and then we can actually collect videos from the wild where we see instances where people fall and we can learn if our parameter is actually useful in predicting whether those Falls might occur and finally we're able to say well uh given the fact that we might want to predict multiple paths and environments we can use diffusion models to actually predict those uh multiple hypothesis paths so here we use a segment anything framework from Facebook to actually partition uh doors walls obstacles ground and then with this watch the data where people walked and then predict multiple hypothesis of where they may end up so here you see two distributions of paths some that lead through this doorway others that make a turn and both of those are um actually uh valuable outcomes so the takeaway I want to leave you with today um Is that real world user human data can serve as a strong prior for teammate prediction through generative modeling and the ability to predict the teammate can lead to more effective collaboration and I will end with that thank you very much I S Monroe thanks so much this is really interesting and a very insightful presentation you and your lab are doing some really great work and um it was really great to to have you share that with with the audience today um now let's get into some questions Bill did you want to kick us off um and then I will take a look at what we have in the Q&A well we've got you know we have a limited amount of time so I think I'll just I'll have one question but let's let's jump into the Q&A uh Monroe you mentioned some things about you know fall detection or um folks who have disabilities can you talk a little bit more about what your lab was doing um with these predictive models that might be useful for folks that have either limited Mobility or other or other um disabilities yeah um so two of the ones that I mentioned today particularly um included like the intelligent prosthetic arm um and the other like the fall prevention sensor which you just mentioned um I think in both cases the question that we're asking is one of intention estimation right there either in the case of the intelligent prosthetic the person does not have an ability and we want to endow them with that uh ability but they have the uh the ability to give us some sort of input um but if you're trying to control these systems the like the degrees of freedom that you need to control are so high and complex uh that it's very difficult to extract that low-level control from the person so we can bring robotic autonomy where a robot knows how to do a task um and then couple that with ever ways of reading the person right estimating their intent then we can uh uh return hopefully some of that ability some of that Agency for them to do particular tasks and in the case of of of the fallaw prevention sensor this is augmentation so we assume that the person is capable but they could use some help to avoid some unfortunate outcomes and this could again be in fall prevention or in like exoskeleton control the idea here is you may not necessarily be replacing their ability um but your ability to predict and assess and predict risk um becomes useful in protecting or helping these individuals it's just it's just fantastic and you know the I I have a sense from your talk that we're right on the edge of some amazing you know new uh capabilities NE to augment you know um human capabilities or to replace things that for whatever reason age or or disease or something has caused us to have a disability um Anette do you have some questions from the from the audience I'm sure they are full of questions I do there there are a ton so um I will'll try to get as many as we can but for for to kick us off um Professor Monroe as we progress towards machines understanding human intention what potential positive and negative impacts could this have on society and how can we ensure responsible and ethical development in this domain it's a great question um you know I would start off with with the sort of positive by really just emphasizing need um I think to date many times you know the conversation with robotics and robotic autonomy is concern around replacement um but the goal of Robotics is to improve human life um and so that doesn't always necessarily mean a robot replacing a particular task if the task is very dangerous for people and people are getting hurt or they're being mistreated then perhaps a robot should be replacing those instances um but if that's not the case then maybe augmentation is the answer how do we make people more efficient um and so I think you know thinking about uh examples in Assisted Living you know you could have robots that could work with um people who want to live in their house longer um maybe they have small issues with dexterity and other things but they're uh cognitively all there and you know they could hopefully live on their own a little bit longer with the Aging population worldwide having these types of systems that could help in that way would be extremely helpful um and you can think of uh other instances in agriculture um in terms of like you know helping with sustainable Foods having robots that could uh work along Farmers to achieve those things uh you could think of the elderly population as I mentioned manufacturing making sure that's Humane effective while you know not necessarily diminishing output um are all very good uh potential um outcomes um you know I think with as with every engineering tool um ethics is important and there are ways that these things could be potentially abused or misused um one that immediately comes to mind you know is um you know I think at some point we want to make sure that these Solutions are democratized they should be accessible to everybody and so um it would be great to say you know I have this robot that has all of these capabilities I would love to say if I put this robot in an old older person's home maybe there's a flat rate uh that they have for this um and there should be a level of functionality that they should not be nickel and dimed for you know oh my robot can make tea so you have to pay you know a subscription fee of $10 a month if you wanted to make tea for you it if it knows how to make tea it should make the person tea I think you know not uh making sure that the fundamental skills of these systems are democratized so that there's accessibility and usability and that we you know stay on track for net positive effort um again I think you know these tools are very powerful you know even more fundamentally than the robotss the power of prediction is itself a very incredible thing we predict the weather and it leads to you know um us being able to protect ourselves right um but everything you know again has the ability to be misused so I think there is an idea of you know fundamentally thinking right you know it is our responsibility to make sure that the solutions we develop are helping people and not harming them um and at some level you know with every powerful tool it does become on the honest of the developer to make sure that they're um being used in responsible ways and Society to hold people accountable when they're not yeah that's great um thank you so much for those insights um I think we have time for about one more question so um I know you touched on this a little bit in the presentation but one of our participants is interested in in kind of expanding more around what role does artificial intelligence and machine learning play in advancing the field of human robot interaction specifically and enabling machines to understand and adapt to diverse human intentions that's a great question so um you know I think machine learning has again played this really fundamental role of um allowing us to model stochastic processes right um you know if you put an input in and you expect a single output that's an analytical description that can be very useful for things where the rules are very defined um but you know data science is becoming very popular because the real world is often very hard to model um and also very noisy um however there's a a significant power of Statistics right to be able to understand the distribution of particular outcomes and uh machine learning and AI is that tool that sits on those statistical principles in order to allow us to predict those events with high likelihood um I think with that ability of prediction with that ability to model things that are not analytical we will see robots that are able to adapt to uncertainty in their environment to adapt to variability that exists um and with this ability to adapt um we'll see their utility spike in a very significant way and with that I think we'll see just Improvement in human life as these physical embodied intelligence systems are able to improve the world we live in yeah a very interesting time um in in life and I am very exciting so thank you again so much Professor Monroe this is this was really interesting and insightful loved the conversation to everyone in the audience with us live today thank you for your questions and your participation I want to remind you that today's session was recorded and a link will be sent out to you within a week um please have a great rest of your day and I'll see everyone next time
2024-02-03