hi everyone my name is lima murshi i'm a senior principal research manager at microsoft research and i'm here today with my colleague mahela vorvaranu director of ux research and responsible ai education at ether ether is microsoft's advisory committee on ethics in engineering and research and we're here today to talk to you about how you can create human-centered ai with the human ai experiences toolkit or the hacks toolkit for short before i begin i just wanted to acknowledge the many researchers and practitioners and interns who have contributed to the toolkit over the years the toolkit has really been a multi-year collaboration between microsoft research and ether and it's about has been about understanding challenges in building responsible and human-centered ai and developing solutions and tools to help so in this tutorial we're going to first give an overview of what human centered ai is and why we need it then we'll briefly introduce the hacks toolkit which contains a suite of tools that we've been developing to help ai creators and teams build ai technologies in a human-centered way and then finally mahala will go through each tool in a little more detail with a case study of a team at microsoft who has been using the toolkit for their own ai driven product okay so human centered ai what is it and why do we need it so first i'm going to start with some context so here you see a map of the 2020 data and ai landscape which is published every year by first mark venture capital and it really just goes to show the breadth and the reach of ai in our everyday applications services and industries and while we know that ai and machine learning are really powerful technologies that have the potential to enhance our capabilities and improve our lives as a ai extends into our every aspect of our lives we're also starting to see more and more evidence of its potential to cause significant harm right and so for this i'd say pick your headlines right on the left here we see computer vision technology which has been used to help blind and low vision people sense who and where people are in their environment this is from microsoft's project tokyo but on the right here we see we see very similar technology that has been used by law enforcement officers in new york who use an undisclosed facial recognition system to find and harass a black lives matter activist named derek ingram that you see here okay so it's this potential for harm that is causing society to really start to demand change and accountability particularly for when ai is used in critical and life-changing scenarios and so the industry is really being called upon to reflect and rethink how we build our ai technologies to ensure that we can do so in a responsible way now there are many challenges to creating ai responsibly including cultural challenges around shifting mindsets and embracing diversity throughout product development it also includes organizational challenges such as aligning business objectives with responsible ai objectives but today i'll be focusing on some of the technical challenges meaning tech challenges around how we build ai technologies in a responsible and so when i say technical here what i won't be talking about actually is algorithmic challenges or automated solutions for building ai and that's not because that's not necessary it's because we're really a long way from purely technical solutions that can automate all aspects of developing ai technologies and we may never even get here because responsible ai still requires a lot of human judgment and careful decision making throughout the process of building and deploying our ai technologies for example it can involve making trade-offs between different values and goals when deciding what to optimize during model building okay so instead today i'm going to focus on how we can support the people building our ai technologies so that they can do so in a responsible way so this is where human centeredness comes in that is building ai responsibly requires that we adopt some human-centered practices okay so what's that um people use this term in many different ways i like to relate it back to its origins and human-centered design which says that it's about ensuring that what we build benefits people in society and that how we build it begins and ends with people in mind okay now you might be thinking well that sounds really nice and warm and fuzzy but how do we actually do this in practice now there are some general best practices for building ai in a human-centered way this includes first thinking about end users and thinking about them early so that you can tie all technical decisions about your ai model and system back to users goals and needs so what i mean by that this is that human centeredness prescribes doing that upfront research and understanding of people and the variety of contexts in which those people may be using your ai system and using that to make all the decisions about the rest of the system so for example if your aia based technology is supposed to work on a broad range of people then the data you collect should really be representative of all of those people similarly if your ai technology is going to or the end users of your ai technology are going to need an explanation it's really important to choose a model that can provide an explanation or choose an interpretable model now another best practice in human-centered ai is that if your ai isn't intended to work for a broad range of people and scenarios it's really important to involve diverse perspectives throughout development this includes talking to diverse sets of potential users but also involving diverse members of your team especially when making critical decisions about the system's functionality and capabilities human-centered ai also suggests that we anticipate and plan for different types of failures so that the people who are ultimately going to be using your system can recover when things inevitably go wrong okay so these are just some best practices for human-centered ai but it's easier said than done and this is why we created the hacks toolkit to help operationalize these best practices so the hacks toolkit which you see here on the left and you can reach at the url on the screen is a set of tools for creating responsible ai experiences and so our goal here with the toolkit as i said was to really operationalize some of these human-centered best practices in our everyday work and we do this by creating tools for really empowering interdisciplinary teams working to build ai and that includes teams involving model builders engineers designers user researchers and pms and each of the tools in the toolkit is grounded in explicit needs that we've discovered through our engagement with product teams and has been tested with real product developers and engineering teams and now these are research-based tools and so we're continuously evolving them as we learn and work with teams to make sure that they're they're usable and effective so in the rest of this talk i'm going to be introducing you to some of the tools in the toolkit which you can actually start using right away if you go to this website so currently the toolkit includes these tools that help with different parts of the ai development process and more will be coming soon as as we develop them so the first tool is the guidelines for human a interaction which is a set of best practices for how ai systems should behave during human interaction with an ai product second is a hacks workbook which is a tool to guide teams through planning and implementing human ai interaction best practices the hacks design patterns are a set of flexible and reusable solutions to recurring human ai interaction problems so these are about how to implement human ai interaction best practices and then finally the hacks playbook is a tool for generating scenarios to test based on likely human ai interaction failures so i'm going to briefly go over each of these so first the guidelines for human eye interaction so the guidelines were actually released back in 2019 and this was a collaboration with people across microsoft to synthesize and validate best practices for designing human interaction with an ai system and we created the guidelines because not only is ai extending into our everyday lives and technologies it's also really fundamentally changing how we interact with those technologies for example by enabling new methods of interaction and new sensing capabilities at the same time ai is also creating new challenges for people during interaction and we see this in everything from humorous ai failures like when our conversational agents misunderstand us like we see on the left here to really dangerous situations like when an ai is being used in high-stakes scenarios so on the right here we see a semi-autonomous vehicle which missed a fire truck that was stocked stopped on the road and in this case the driver wasn't able to intervene and recover in time okay so we created the guidelines to help ai design and developers design for safe and effective human interaction with ai now i won't go into details of how we created these other than we did four rounds of synthesis and iteration and evaluation of the guidelines with over 60 user experience professionals and tested against 20 ai real ai products and so you can learn more about this in the paper but the point of this is just to say that we didn't just make these up we really endeavored to take a systematic and rigorous research-based approach to developing the guidelines so we ourselves could feel confident in their effectiveness so here are the guidelines which you see on screen there are 18 of them and they're broken up into four categories roughly based on when they would apply as a user interacts with an ai system okay now these are rough categories they're not hard assignments um but we categorize them this way just to make them easier to remember so the first category here initially is really all about setting expectations and expectations are important because people often have unrealistic expectations about ais because of their complexity and how they're portrayed in the media but we know that unrealistic expectations can not only lead to frustrations and product abandonment it can also lead to over-reliance and under-reliance during interaction which can be detrimental depending on the scenario of use the next category is all about context and context is really important for ai because ai systems typically make inferences about people and their needs and those needs depend on the user's environment including their current task and their attention as well as a larger social and cultural context in which they're operating so these guidelines describe considerations for designing your ai system to match those social and cultural contexts in which people will be using it the third category is all about what to do when the ai is inevitably wrong and i can't stress this enough your ai will be wrong and so if there's one thing i want you to take away from this tutorial is that your ai will be wrong and so it is very important to include mechanisms to reduce the cost to users when this happens and so these set of guidelines outline how to do this for common ai failures and errors then the final category here is all about what to do as the user interacts with the system over time and this again is important for ai because one of the key benefits of ai models are their ability to learn and improve over time but this can also be disruptive if that interaction is not designed carefully okay now i won't go into details of the guidelines and you'll but you'll learn about more of them throughout the rest of this talk but teams have been using them um throughout microsoft and externally um and they say things like the guidelines are you know really all about how to create trustworthy assistive systems that have really been given a tremendous authority over human life and we saw that from the beginning as our ais are extending into our everyday lives another person who's been using the guidelines on their team said it took us four years to come up with 80 of the guidelines if you can institutionalize this into the design parts of your products you really give people an opportunity to build a much better product in gen 1. and that's really what this is for now while the guidelines are widely used throughout our many workshops talks and engagement with teams we've continued to learn about challenges that they're facing including how to make the guidelines part of people's everyday work because while they describe best practices there's still work to be done to include it in our everyday work processes in addition the guidelines are as one person said here are based more a lot of them are based more upon engineering than even design and some would require full-scale overhaul of the back end and so what that means is that if the guidelines aren't planned for early in development it's going to be too late to try to introduce them at the end okay so if the spec doesn't have the guidelines built in it's going to be too rigid to respond so these and some of the other challenges that we learned about during our engagements inspired several of the other tools in the toolkit so for example the hacks workbook is what you is what you see here it's a simple and flexible excel worksheet to help guide early planning conversations around the guidelines so the workbook is really intended to be used when you have an idea for a new user-facing ai-based feature and you're starting to sort of define your requirements or if you have an existing user-facing ai feature a prototype that you want to evolve or improve so this is what the workbook actually looks like we iteratively co-developed it with 43 pack practitioners over several sessions to really define the right breakdown and sequence of steps needed to plan for the ai user experience early and to accurately estimate resources required to build them okay so the workbook guides teams through five separate steps first is identifying guidelines that are relevant and this is because you know as i said before there are 18 guidelines but not all of them are going to be relevant for every single product scenario so step one is to figure out which ones are relevant to your scenario in step two you discuss the potential impact of the guidelines on your end users in step three you define the implementation requirements and then you use both the the understanding of the impact to users along with the cost of implementation to prioritize which ones your team should implement and then finally you can track the steps the the guidelines that you've decided to implement in the workbook here or in your own product development tools and so michaela will actually go over this workbook in more detail but i wanted to highlight that in step three here where you talk about different implementations this is where you can bring into the conversation the hacks design patterns okay so the design patterns we created because there are often many ways to implement each of the guidelines and so through our work with the guidelines and with teams we've collected hundreds of examples of of the guidelines being applied in everyday products so what we did here was we work through those examples that we've collected to distill a set of patterns that abstract away those sort of application specific details to describe flexible solutions to recurring human ai experience problems that people could apply new scenarios so we currently have about 33 design patterns for eight of the guidelines that you see here and we're continuing to develop more and all of these patterns along with examples from real ai products are available in the hacks toolkit in our hacks design library so the design library is something that you can filter through and browse to see how the guidelines are being applied in different product scenarios and to look at the patterns to see which ones might be most effective for your product scenario we've also made the design library extensible so people and the community can contribute new examples or patterns as we learn and grow in how to design effective human interaction with ai now the final tool introduced today is the hacks playbook and this was designed for generating scenarios to test based on likely human ai interaction failures and we created this tool because as we were working with product teams we found that it was very hard for them to anticipate the different ways an ai may fail at any given moment and that's because ai systems are probabilistic they're designed to generalize to new scenarios so it can be hard to anticipate when and what will go wrong at any given moment and so what the playbook does is it helps teams proactively and systematically explore common ai failure scenarios so that they can go beyond the golden path and it's really what you see here on the right is something where you can walk through and answer a set of questions about your product scenario and then the system will generate different potential failure scenarios that you can design for and test so each of these tools is available to use now at the at the url that you see on screen and with that i'm going to hand things over to my co-lead mahela who's going to walk through using these tools in a real case study from a team at microsoft so thanks all right so let's take a look at a case study seeing how an actual feature in an app that has been released by microsoft used the hacks toolkit to improve the user experience and make it a responsible one so we'll be talking about the microsoft family safety app this is part of modern life experiences and this particular app is meant to promote digital safety and well-being in the sense that there's a family admin who can set limits for browsing for apps can filter content and avoid surprise spending the app has a feature called now flagged search that flags potentially toxic search terms so that parents or guardians can discuss these searches with their children now these are some early mock-ups of this app and you can see how initially it would highlight um search terms that were classified as toxic or concerning the modern life experiences team worked with us proactively to mitigate potential harms at microsoft we have a process for apps that are or products that are considered to potentially have risks and harms to psychological harms to undergo review and consulting and so this is how we met the team and we worked with them on this app as you can imagine there could be several for example psychological harms to children if they are embarrassed or if their privacy is invaded through some searches that they do especially if you think about them trying to discover their own identity for example and so this was really a sensitive topic and we were really happy to have a chance to work with this team um to mitigate some of these harms so uh the team filled out the hacks workbook over two or three sessions where they worked together so it's important to make sure we specify two things one it's really useful if several members of the team from different disciplines are in the same room working on the workbook together because each guideline may have implications not only for ui ux but also for gene engineering and data science um it's also ideal the team was in the very very early stages of developing this feature and this is a great point where you could stop and plan um for implementing the guidelines for human ai interaction so let me show you their example as an example their responses in the worksheet for guideline two make clear how well the system can do what it can do and so this guideline is about letting users know that the system is not perfect and it might make mistakes the team deemed this guideline as relevant to the flagged search feature and they thought that applying this guideline might benefit a user by making sure they don't blindly trust the insights and so if they see a flagged search they would think and research and be careful about taking actions like talking to their children based on that also in terms of not applying this guideline it might harm a user because parents could over rely on the future could miss some unflagged searches and or decide to take extreme measures taking impact into consideration both for applying and not applying the guideline the team estimated that the impact on the user for this particular guideline would be high moving on to the next step in the workbook the team had to describe requirements of how they might implement it and this should include ui ai data and engineering and in this case the team decided that showing a message to communicate that user that the app the feature might not always be accurate is uh the way to go and they estimated the resource commitment for this implementation as small taking all of these decisions into consideration the team prioritized guideline 2 as a p0 meaning that they would not ship the feature without this being implemented and so when we think about how we would implement a guideline we can go back to the hacks toolkit and look at the hacks design library for ideas the hacks design library includes design patterns for some of the guidelines that outline established solutions to common ai problems it also includes concrete examples from real products and of course what's very exciting for us is that anybody can contribute new patterns and examples and so let's try to take a look and see how might a team implement a guideline and how might they use the hacks design library for help so i'm going to switch over to the website and take a look at guidelines one and two which were both considered p0 for the team so here we are on the hacks toolkit website and since we are interested in ideas for implementing the guidelines we are going to hop on to the hacks design library in the hacks design library as i mentioned before we have patterns and examples for the guidelines in this case we're going to narrow it down because we're interested in guidelines one and two since they were related and both of them were considered p0 by the team and we are also going to narrow it down to just patterns that will give us some ideas for how we could go about implementing those guidelines so if we take a look we can see immediately a pattern that looks interesting which is writing an introductory blurb and if we keep scrolling we could see other patterns for how we could make clear what the system can do guideline one and also some patterns for implementing guideline two for example this one where we use precision in our communication in language and try to match it with the system performance so we're going to take a very quick look at this pattern we can see that each pattern has a problem solution when to use how to use common pitfalls and more and here we could see that one way in which we could communicate that the system may make mistakes is through intentional use of uncertainty in language we can also look at some examples of how this pattern has been implemented in other systems and so for example we could see here one screenshot from linkedin where it sets expectations for how well the system can do what it can do by using tentative language such as recommending people you may know okay so now we have a couple of interesting candidate patterns let's go back and see how the team implemented guidelines one and two as well as the other guidelines that they selected as high priority so if you remember there are 18 guidelines the team answered the same questions in the workbook for all the guidelines that they deemed as relevant to begin with and by the end of the workbook they had decided that these guidelines were high priority for for the future so let's take a look at what they did and how they ended implementing these guidelines the first thing that the team did to implement guidelines one and two they decided to use a pattern the pattern that i showed the introductory blurb pattern and combine it with and also use it to make clear how well the system can do what it can do they included a carousel style tutorial for search activity and we're going to try to go through it in the first tab they explained uh what the feature does and how it flags potentially concerning search terms and note here the cautious language potentially concerning search terms it also delimited what the system can do by saying it works with search engines like bing or google in microsoft edge and only in the english language in the second screen of the carousel it shows further information about search terms explaining that search terms can fall into a number of categories um and this way they will be flagged if they are for example in categories like adult cyber bullying hate speech violence mental health in the third tap of the carousel the system introduces the lookup feature which i will address a little bit later it is also a way of implementing guideline 11 make clear why the system did what it did and this shows that a parent can learn more about a search term if they don't immediately understand why it might have been flagged now as you know those of you who are parents sometimes it is hard to understand the language and the terms that um you know children or tweens or teenagers use and so this lookup feature can help clarify things and finally in the last carousel they also implemented guidelines 15 encourage granular feedback and 16 convey the consequences of user actions by reminding users that they can um let them know how they're doing once again you see here the use of humble language we're still learning and sometimes we don't get it right tap on an item to teach us if it should be flagged or not and help us improve this feature so i'm going to also show a little bit of information about what they did with some of the other guidelines such as guidelines six mitigate social biases and so for this guideline um a lot of time we spent a lot of time thinking about what language to use and we decided even though the team had been using the word toxic that that could be too strong and problematic um in in certain situations and so we narrowed it down and decided to just use flagged the team did some work to reduce some biases in the data set and their models and after lengthy and a very interesting and heartfelt discussion we decided to exclude terms related to sexuality we imagined a scenario in which um a child might explore their own sexuality and maybe wanted privacy from their parents or maybe the parents would be more you know not very open minded and so we decided that that should not those kind of searches should not be flagged at all in order to protect the child and they're discovering their own identity guidelines 10 scope services when in doubt this deals with how the system deals with its own uncertainty when it's not really sure about the user's goals and guideline 11 made clear why the system did what it did which were also high priority to address these guidelines um the team added certainty thresholds and calibrated them so that flagged search items would be shown when the certainty was high they also added screens to explain uncertainty and added this lookup feature right so if a term is flagged and a parent might not understand why it is flagged there is this explanation we flagged it because we think it's the search falls under one or more of the categories that i mentioned before but if that explanation is not sufficient then the parent has the option to look up that term in order to try to figure out what it means and whether it is in indeed a reason for concern or not and of course you could see on the right hand side there's a way to provide feedback we're going to look at it here closer a parent can tap on an item and can indicate whether this item should or should not be flagged which is uh implementation of the guideline of guideline 15. then the system learns from this user behavior and it communicates to the user that indeed it's learning by applying guideline 16 convey the consequences of user actions and showing a message about what how the feedback will be used now let's think a little bit about the hacks playbook the hacks playbook was not available at the time when the team was developing this feature however we can take a look and see how the team could have used the hacks playbook to improve the user experience so i'm going to switch over and show you the hacks playbook really quickly i am back at the hacks toolkit website and i'm going to click on the hacks playbook then launch it and answer the questions in the hacks playbook from the perspective of this feature now this system is a classification system right flag or don't flag a search the primary input modality is text and it's very clear when it should trigger and how it's delimited there could be multiple valid interpretations of the user's input as we saw with the example with wheat weed killer and then the classification is of course binary and you see as i check these boxes on the left-hand side i get on the right-hand side a number of scenarios that i could look at and plan for knowing that these might be possible errors now spelling errors resulting in wrong input might not be of the highest concern in this case but if we look at response generation errors right we see that um there could be ambiguity and fortunately the lookup feature deals with this and we see that false negatives and false positives could also be of concern and so these would be scenarios for the team to think about to anticipate that the system is likely to produce these errors um and prototype test and make sure they account for these errors in the application so now that we've seen really quickly how we could use the hacks playbook i'm going to turn back and look at the application for another second and continue the presentation so coming back to the app the microsoft family safety app was launched it is now available uh the flagged search feature is live and the team told us that they really benefited from using the hacks workbook and they really enjoyed it too in fact the data science manager of the team told us that they've integrated the workbook into their ai processes and they're going to continue using it for any other features that they'll be working on in the future and so we also have a little bit more feedback from teams who have used the hacks toolkit they have found it as a team alignment tool that puts everybody on the same page and we know how important this is as creating responsible and human-centered ai requires deep engagement and collaboration across disciplines which tends to be very difficult the hacks toolkit helped people find lots of points that they maybe would not have considered before um it gives people an opportunity to build a much better product in gen 1 so it shows that taking a little bit of planning ahead of time and a little bit of time to work through the workbook for example it can really save time in the long run and create a much better product the playbook can standardize the error case design and the guidelines serve as research evidence for design decisions and so we were very happy to hear this from teams but we would also like to hear from you just as a reminder these are the four tools that are in the hackstool toolkit right now but um we are hoping to bring up more and we are hoping to keep contributing to microsoft's holistic approach to responsible ai which includes not only principles but also tangible tools and practices for data science and engineering but also for ux design user research and pm in fact we have a web page with responsibility resources that i encourage you to visit and you will also find there a research collection of research papers published by microsoft on the topic of responsible ai but speaking of you contributing we invite you to use the hacks toolkit when you are working on user facing ai teach it if you are a faculty member we encourage you to submit examples and patterns to the design library and of course to engage in research that extends these tools to new scenarios or contexts as well as building more tools for parts of the design process that maybe we haven't yet addressed we are very much looking forward to hearing from you and working with you and you can use the email address hackstoolkit at microsoft.com to contact salima and me thank you so much and we look forward to hearing from you
2022-02-11