Evaluating the Equitability of Commercial Face Recognition Technology in DHS Scenarios
- All right, good afternoon. My name is Arun Vemury with the DHS Science and Technology Directorate's Biometric and Identity Technology Center. I'd like to thank you for joining us today for this webinar on Evaluating the Equitability of Commercial Facial Recognition Technologies in DHS Scenarios.
This is part of our webinar series where we are providing better, you know, providing information about some of the work we're doing, specifically in the area of assessing the accuracy, fairness, and equitability of biometric technologies and matching algorithms. So with that, I'd like to also let you know that we're joined by two of my colleagues, Dr. Yevgeniy Sirotin and Dr. John Howard, who will be briefing you today on some of the work that we're doing.
Let's go ahead to the next slide, please. Alright, we'll provide you a quick overview of some of the things that we're gonna be talking about today. We'll provide you an introduction to the Biometric and Identity Technology Center, talk about the Biometric Technology Rallies and how that data collection and that activity leads to data and helps inform some of our research, test and evaluation activities, and the work that we're doing on standards to help inform broad international engagement on measuring the effectiveness of these technologies, not only in terms of raw biometric performance but also, again, fairness and equitability. And we'll also talk about some of the work that we've been doing in 2021 with our most recent Biometric Technology Rally to help take a look at how industry has progressed over the last year. Next slide. Alright, so within S&T's technology centers division, we have the Biometric and Identity Technology Center.
And what we do is core foundational research into topics related to biometrics and digital identity. Our goal is to help drive innovation throughout S&T and the DHS components and headquarters agencies through research development, test, and evaluation. Our intention is to help facilitate better understanding of lessons learned and help people understand how, sorry about that, how technologies are continuing to evolve and provide greater transparency and understanding for DHS components who are trying to have a better understanding of how these technologies may be useful for their specific operations. Our goal is also to help drive efficiencies. If we find that technologies are working well for some components or that there are best practices to be shared, it's our intention to help make sure that that knowledge is shared across different components and missions.
We provide objective subject matter expertise across the enterprise, not just one component or a mission, but make sure that that's broadly available. And we work actively with industry and academia to provide not only a better understanding of where we have technology needs and gaps, but also to spur innovation and to help, you know, provide mechanisms to evaluate and provide feedback so that they can make better technologies over time. And with that, let's go on to the next slide. I'll kick it over to Dr. Sirotin,
who will provide a background on the Biometric and Identity Technology Center Technology Rallies, and talk a little bit more about how that feeds some of our supporting research. Thank you. - Thanks, Arun. DHS S&T created the Biometric Technology Rallies to motivate industry to provide innovative biometric technology solutions focused on DHS technology use cases. Specifically, the rallies were designed to address key technology risks outlined on the left.
We believe these risks are relevant across a variety of biometric technology use cases, many of which will be discussed at this webinar. These risks include effectiveness risks, or high failure rates, efficiency risks, or technologies that are too slow or require excessive staff, risks due to the satisfaction of the users of the technology leading to potentially low adoption or just unhappy users, and, of course, risks to privacy, whether PII gathered by these systems is stored securely. And each Biometric Technology Rally is carefully designed to focus on a specific biometric technology use case. Oh, and of course, the subject of this webinar is the equitability risk, which focuses on insuring technology works for everyone.
So each Biometric Technology Rally is carefully designed to focus on a specific biometric technology use case and provides independent quantitative assessment of current industry offerings. And the rallies really help DHS collaborate with industry, and they do that through cooperative research and development agreements, which are entered into between DHS and technology providers so that they can share information and make their technologies work better within the DHS scenario. Going on to the next slide, we see that here is a little bit of a primer on scenario testing versus technology testing. So the Biometric Technology Rallies are specifically scenario tests, and as such, they fill a specific need in testing biometric technologies that's kind of unique.
Scenario tests are laboratory evaluations that fall in between sort of the operational testing like pilot deployments on one side and technology tests that are done in computer labs on the other. And so what I'd like to do is highlight the difference between technology tests, like, for example, NIST's FRVT tests that folks are familiar with, and scenario testing like the Biometric Technology Rallies. So technology testing focuses on a specific biometric technology component, for example, a matching algorithm in isolation, whereas scenario tests, on the other hand, are centered around a specific technology use case, for instance, a high-throughput airport checkpoint, and they include the full multi-component biometric system, so everything from user interaction, camera location, and, of course, biometric matching algorithms. So technology tests generally reuse biometric datasets and images that have been collected in the past and they benefit from these larger sample sizes, whereas scenario testing, by contrast, gathers all new biometric data each time in a way that simulates the operational environment but consequently, we work with smaller sample size in this case. So what's important here though is technology testing answers different questions than scenario testing.
So technology testing answers questions about how technologies advance or perform relative to each other, especially at the limits of performance. So sort of think racing cars along the Bonneville Salt Flats, you know, the kind of cars that people race there to see who can beat the world land speed record are very different than the kind of cars you drive around town. So scenario testing answers questions about how well the technology performs within an intended use case. And in this case, you could think of, you know, if you're driving your car around town versus along a dirt trail versus some other specific scenario where you need to get from point A to point B. So the scenario test is really tailored to answering questions around that track and not in principle. So technology testing also answers questions like, for biometrics, what is the minimum false match rate achievable by face recognition technology? Whereas scenario testing answers questions like how will face recognition perform in, say, a high-throughput unattended scenario, like at an airport? So the work we perform at the Maryland Test Facility involves testing different types of biometric technologies which include, of course, face recognition.
And the main test we perform, the rally, is focused on assessing a multitude of commercial face recognition and multimodal systems in DHS use cases. So we've been running the rally since 2018 and the most recent assessment was carried out just a few months ago. To date, we've tested more than 200 combinations of commercial face acquisition systems and matching algorithms in this high-throughput unattended use case that we've been simulating through these years. And these rallies have provided some really comprehensive metrics about these tested technologies, which include, you know, how quickly they work, their efficiency and transaction times, the effectiveness of these technologies, you know, the ability of them to reliably acquire images and match them, satisfaction, you know, the user feedback that people leave about these technologies, as well as more recently, and the focus of the 2021 rally, is the equitability, making sure that technology works well for different demographic groups.
And a lot of this work you could find at mdtf.org. So in addition to the summative metrics of technology performance, DHS S&T has used the data gathered as part of the rallies to help answer important questions about the way that commercial biometric technologies work, including questions regarding whether the technology is equitable, fair, or biased through advanced data analyses and publications in scientific journals. And I have a few of them on the right side of the slide here. And our publications to date have addressed a number of research topics, which include, for example, looking at the role of image acquisition in shaping demographic differences in face recognition system performance, establishing the influence of race, gender, and age on the false match rates, estimates of face recognition systems, something we'll touch on today, quantification and comparison of race and gender differences in commercial face and iris recognition systems, as well as cognitive biases introduced by face recognition algorithm outcomes in human workflows. So while some systems test well with diverse demographic groups, there are some demographic performance differentials that persist in both acquisition and in matching components of biometric systems, and these require careful evaluation. So I'll give you some examples today and so will John later on in this webinar.
So specifically, what I'll start with is data from last year's, the 2020 Biometric Technology Rally, which was the first rally completed during the ongoing COVID-19 national emergency. So as the emergency unfolded from February and into the fall of 2020, masks became a part of life in the travel environment, and removing masks for face recognition now became a potential new source of risk to unvaccinated travelers and to staff at the airport. So for the 2020 rally, we challenged the industry to provide face recognition technologies that work in the presence of face masks.
And this rally was the first large-scale scenario test of such technologies and we compared how well they worked for individuals without masks and the same people wearing their face masks of choice using the technology, and all of this while simulated a high-throughput unattended scenario environment. So what do I mean when I say an unattended high-throughput scenario? And I think I alluded to this a few times before, kind of like an airport checkpoint, but here's the main properties of this scenario. One, the face recognition system had a limited time to operate, actually just eight seconds on average per person, the face recognition system gets to acquire just about one image per individual, you can't acquire 10, and the identification gallery that you're working with is small, you know, typically you wanna identify people boarding a particular aircraft, 500 people. Most people being matched are in the identification gallery as well so there's very few people that would be out of gallery in this case, people who are not on the plane.
And consequently, the impact of errors of those being matched is dominated by one kind of error. It's called a false negative error or false nonmatch, and the consequence of having a false nonmatch is a delay or denial of access to an aircraft. So in this case, that's what I'm gonna focus on. In the later part of the talk, Dr. Howard will talk about the other type of biometric data. So in this rally, the 2020 rally, a total of 582 diverse volunteers participated, and I show you sort of a demographic breakdown here by age, race, and gender.
It's a complex graphic but what it conveys is that we had people that participated in this rally come from all sorts of demographic backgrounds, all ages, 18 to 65, males, females, and folks from different race groups. All of this demographic data is self-identified by the volunteers. So there were some volunteers that self-identified as Black or African American, volunteers that self-identified as white, Asian, and, you know, a number of other groups for whom we had a limited sample. Throughout the testing, volunteers used their own personal face masks. And in this rally, six commercial image acquisition systems participated and 10 commercial matching systems participated for a total of 60 system combinations tested.
You see, in the rally, we test different acquisition systems with different matching systems and we're able to see a whole variety of performance. So all of these systems had to acquire a face image from each volunteer, and then that face image was used to identify each volunteer against a small gallery. And so what did we see? So the first part of the rally tested these systems without face masks, so everybody was wearing their face mask but right before biometric acquisition, we asked them to take their mask off so that they could go through the biometric system.
And the infographic that I have here on the left shows the overall performance of the median system without face masks. So the median system actually did quite well because face recognition has come a long way in the last decade. And the graphic on the left shows that overall, the median system was able to identify 93% of these 582 volunteers. So overall, you can see there were few errors due to matching. Just 1% of the errors were due to the matching system failing to identify based on a collected photo. But more numerous were issues with image acquisition, so where the camera failed to take a photo.
That was for 6% of the individuals in our sample. So overall, this is well in line with what we typically see in these scenario tests is that actually algorithms have gotten very accurate and they now don't dominate the errors of the biometric system. A lot of the errors are now made by the cameras. And on the right, I'm showing you something that we call the disaggregated performance of the system across demographic groups. So on the X-axis, I have the different demographic groups, Black, white, Asian, and other, and each point on this graph represents the true identification rate for a given system combination, there are 60 total, across our sample for each one of the demographic groups.
And you could see that this true identification rate is very high for each one of the systems tested. So for any system tested I could find, for any demographic group tested, I could find a number of systems, and that's the number above these colored plots, a number of systems that have performed, you know, within the acceptable range, and that's that gray bar on the graph. So you could see 22 systems performed well within the acceptable range for volunteers that identified as Black or African American, 32 systems for volunteers who self-identified as white, 25 systems for those that self-identified as Asians, and 12 is for other. And this TIR is actually inclusive of all the sources of errors, so failures to acquire and failures to match.
And on the very right side, I have another type of TIR which we call Matching TIR, and that one just focuses on algorithm errors. And you can see that when you take away errors of acquisition, essentially all the systems were able to, you know, most of the systems, were able to meet this high performance bar across demographic groups. And of course there are some systems that just didn't perform very well for technical reasons, and those are the dots that you see at the very bottom. But overall, the results look very similar across demographic groups.
So what happens when we had folks keep wearing their face masks? So obviously a lot of these biometric systems were redesigned very recently in order to be able to even handle images with face masks. So we asked, okay, well, what impact now does that have on the performance of face recognition? And just to give you an idea, these images here are the diversity of the kind of images gathered by our system and the diversity of the face masks that we had in this evaluation. You could see all sorts of masks here.
You could see blue surgical masks, you could see a lot of personal masks, masks with patterns, masks that are dark or light, different colors and different patterns as well. So this was really a good representative variety of the kind of face masks that people wear within the travel environment. So what did things look like? So again, on the left, I'm showing the performance of the median system, and indeed the performance with face masks was lower, with the median system now identifying only 77% of the 582 volunteers. Remember the previous slide, it was 93%. But impressively, the best system combination was still able to identify 96% of all volunteers. Again, you know, now the errors were a little bit higher so the algorithm failed to identify 8% in the median system and the camera failed to take a photo on 14% for the median system, so again, a lot of problems with even acquiring an image and they're exacerbated by face masks.
But what I'm showing you on the right is, again, this disaggregated performance, I'm showing you this true identification rate. And I have an arrow marking an observation that we found worth highlighting, is when we disaggregated the system performance in the presence of masks, the results look very different. So of course overall performance went down but performance for some demographic groups went down more than for others.
So, for instance, the performance for individuals that self-identified as Black or African American was particularly lower such that now no system combination achieved acceptable performance, this gray band, for that demographic group. Whereas for white, you see that five systems met that criteria, for Asian, 15, and then for other, eight. And you could see that this persists even if you take discount any failures to acquire, so on the very right, it's the same plot, but now on the Matching TIR, fewer systems met the matching only criteria, you know, 7 versus above 26 for all other groups, 7 for Black or African American. So what this shows is that face masks not only decreased face recognition performance overall, but they also unmasked some demographic differentials, which we didn't see when the faces were not masked, when people were taking their masks off. So let's look at the performance of this best-performing acquisition and matching system combination, which performed relatively well, as I said earlier.
And you could see that without masks, this system actually matched every single person. It didn't make matching errors. So the best system combination really worked well for everyone and you could see that without masks on the left. But with masks, on the right, you could see that even this best system combination failed to reach 95% true identification rate for volunteers identifying as Black or African American.
So this sort of sets the upper bound of what was possible as demonstrated by this rally. So what have I told you so far? Well, face recognition technology can work well across demographic groups, especially without face masks. And these findings are similar to the findings from past rally scenario tests, but acquisition and matching errors don't always increase equally when the system is perturbed, in this case, by the addition of face masks. So if a system were to be evaluated for demographic differentials without face masks, you'd say that most of these systems did pretty well. However, as conditions change, like when people started wearing face masks, then these perturbations can actually unmask some demographic differentials that exist in the systems but just aren't visible without. So we found that this performance can decline for some demographic groups more than for others.
Both acquisition and matching performance was affected. It's not just the matching algorithm that's important. We believe that there's a lot of effects of acquisition cameras that contribute to this, and that future research is gonna be really needed to investigate differential performance of the technology that underlies these differential outcomes. So the takeaway here is a fair system under one set of operational conditions may become a little bit demographically unfair when conditions change.
And so from this, we recommend ongoing testing to track system performance and to include fairness as part of that as conditions change. So at this point, I'm gonna hand things off to my colleague, Dr. John Howard, who's gonna talk about the other kind of biometric error. So everything I've told you so far has been about false negatives and what John is gonna talk about right now is a different kind of error, false positives, which has a different kind of demographic effect.
And John, I'm gonna hand it off to you now. - Okay. Just a head nod from someone you can hear me. This is good. Excellent.
Okay, so yeah, biometrics and what we call demographic equitability, this is a topic that we found ourselves very heavily involved in in sort of the last couple of years. And it sort of means, you know, how well do biometric systems work across different groups of people, right? And this can be a lot of different things. It could be white/black, it could be short/tall, it could be light skin/dark skin, it could be male/female, but one of our goals is to sort of evaluate and to encourage industry to make biometric systems that sort of work equally well for all these different groups of people, and so that's what this topic we're gonna look at today is. You can go to the next slide. Right, so this may be a topic that some of you are familiar with. It's actually been in the news a lot lately over the last couple of years and in some fairly prominent places.
We had articles in places like "Nature" that you see there on the upper left, which is, you know, a leading scientific publication that sort of asked the question, is facial recognition to biased to be let loose? We had some material in "MIT Technology Review," another very prominent tech reporting outlet that said, you know, a US government study actually confirmed in their words that face recognition systems were racist. And then the other thing I'll point out on this slide is it's not just a US issue. There's been a lot of activity in Europe and especially with regulatory bodies asking, you know, do we need to ban these technologies? And then the quote I'll point out in the lower right there is that the reason that these technologies are sometimes seen as discriminatory is because of this clustering effect, that they cluster individuals by these demographic groups, whether that's race, ethnicity, gender, et cetera. So we wanted to sort of ask the question, you know, what does that mean for commercial face recognition to cluster people by those demographic categories? And if you go to the next slide, we can kind of see visually what that looks like. A little context here, when people come to the Maryland Test Facility, we take a picture of their iris patterns and a picture of their face.
And we ask ourselves, have we ever seen this person before? And that's 'cause we wanna have really good ground truth information about who the people that are involved in our testing are because the biometric error rates that we evaluate these systems on are sort of based on that ground truth information. And what we found is that sometimes, the computer systems we use to do this, incorrectly think that person has been there before, and when that happens, they'll send us back a picture of who they think it is. And so what you see on the left here are people who experienced this false match behavior, that's what that's called, with their iris pattern.
And what you see on the right is people that had that false match error occur with their face pattern. And you should notice something that's rather profound, it's that all of the iris recognition false matches are not sort of related demographically, right? They're not the same gender, the same race, or the same age, but the same can't be said for the face recognition false matches. Every single person you see sort of on the right-hand side there, more or less, they're all the same gender, they're mostly the same race, and they're more or less the same age as well. And that's a characteristic sort of unique to facial recognition. It doesn't happen with iris recognition and it doesn't happen with fingerprint recognition. And that's something we sort of observed while we were watching these systems operate in real time with face recognition specifically.
Next slide. So most people watching this, that's probably not surprising that face recognition does that. If you were, you know, a computer scientist or someone that was evaluating a face recognition algorithm and I showed you the last slide, you'd probably think that's a working face recognition algorithm, that's what it's supposed to do. And I'll challenge you to sort of understand that I think the reason most people think that is because unlike iris recognition, humans do face recognition as well, right? We have brains that have evolved to do face recognition. It's important for a number of reasons where evolutionary speaking, we recognize mates, friends, foes, and we study this in neuroscience and human cognition so much to the point that we actually know the part of the brain that does human face recognition. It's the part you see sort of highlighted in the red here.
And so to us, it's intuitive that a computer algorithm would also think people that share, you know, gender and race are more similar, but we think that sort of gives us sort of an unconscious bias when it comes to humans evaluating, you know, how well a face recognition algorithms work. We think they should work like that so when they are, it's not surprising to us. And it's sort of our claim here that we need to overcome that human intuition so that we really can objectively evaluate these technologies.
Next slide. Okay, so this is kind of mathematically what that clustering looks like. I showed you just digitally what it looks like, but the chart you see here in the middle, you can sort of picture every row and every column is a different person, and the value in the cell is how similar a face recognition algorithm thought those two people were. So a couple of things you should notice looking at this sort of matrix here is that the diagonal is all very dark and that's because face recognition algorithms think two of the exact same images look very similar, right? Again, working. And the other thing you should notice is sort of this block diagonal pattern that moves along the sort of middle of the chart.
And that's because, again, face recognition algorithms tend to think Black females look more similar to other Black females than white females do to Black females, which is sort of these two blocks you see along the bottom row there. And that same effect exists for other demographic groups as well. The face recognition algorithms tend to think white females look more similar to white females, Black males look more similar to Black males, white males, et cetera. It's not limited to just one of those groups.
Next slide. So that sort of block diagonal pattern there that exists can be problematic, right? It has this effect that I just sort of outlined. And so we wanted to ask ourselves, do we think it's possible to make a face recognition algorithm that doesn't do this, that behaves more like a fingerprint or an iris recognition algorithm where if you took my fingerprints or my iris patterns and you searched me against a whole gallery of people, the person that comes back most similar to me is not going to be, in all likelihood, another 30-year-old Caucasian male. And so we asked ourselves, do we think it's possible to train a face recognition algorithm to do something similar to that, where it's just as likely to confuse me for, you know, perhaps a (indistinct) Asian woman as it is a 30-year-old Caucasian male, and it turns out we think the answer to that is yes. So that would be moving from this sort of matrix pattern you see in the middle to the pattern you see on the upper right there, where there really is no discernible pattern along the middle. I'm not gonna go into sort of all of the math behind how we did this.
It's kind of a lot for this short kind of presentation. I will say we published a paper. It's actually a DHS Technical Papers Series.
It's on the Biometric and Identity Technology Center website right now, where all that's laid out. I encourage you, if anyone's interested, to go download that and go through it. I'm also happy to take questions on it in the Q&A. But it essentially comes to these conclusions that are laid out on the slide here that we can prevent face recognition algorithms from taking into account race and gender information when they're making identity determinations and it'll still be a functioning face recognition algorithm. It'll still distinguish people from themselves and from other people. There's also a conclusion in the paper that this technique might, could lead to slightly less accurate face recognition algorithms overall, but algorithms that lead to more fair outcomes, which we sort of point out is a trade space that's worth exploring and a conversation that's worth having.
So yeah, I encourage everyone to go to the Tech Center website and look at that if you're interested. Next slide. So right, as part of this demographic work that I just outlined that we've been doing, we're also heavily involved in the international standards community. I mentioned this isn't sort of just a problem that's unique to the US.
A lot of other nations are having it as well. Go to the next slide. And that's because a lot of people are sort of starting to use face recognition systems. It's seen a sort of an explosion in use cases over the last couple of years. With that explosion in use cases has come, we think, a sort of an increased public awareness and also some concerns.
This has trickled also into the policymaker space. I've got two US Senate bills listed here on the slide, 3284 and 4084, that are both essentially regulation or restrictions on the use of face recognition specifically. There's similar actions pending both in Australia and in the EU, probably elsewhere that I don't even have listed on this chart.
And I think some of that stems from, you know, researchers don't always talk about this the same way as well, and so part of this international standards effort that DHS has really taken a leadership role in is sort of coming to that standard. Next slide. It started actually a few years ago. This is ISO Technical Report 22116 that DHS had actually consumed the editorship on. It was the first to sort of think about how would we do studies of demographic equitability on sort of the international stage. It defined some terms, sort of looked at different areas in the system where performance variation can exist.
That was actually published almost a full year ago now. And that led into some activities that we're currently undertaking, if you go to the next slide, which is the actual international standard. This is ISO Document 19795-10 and it is how to quantify biometric performance across demographic groups.
It was proved shortly after that TR was released, they sort of looked at it and said this is a worthwhile topic and we've got the starts of what could be international standard here. We actually just put out the first draft of that. Yevgeniy and I are actually the co-editors of it.
The first draft came out this summer, The final (indistinct) should be approved for publication sometime in the 2023 to 2024 timeline. But that means it's sort of open right now for comments and for new material and things like that. And DHS, as well as some other US government agencies, are really taking that opportunity to add material into the standard, which we think is a really good thing.
(indistinct) through some of this material. I think we only have a couple slides left and then we'll get into the QA session. But this is what's within scope of that ISO standard, essentially what the title says, how to estimate and report on performance variation across demographic groups. The document attempts to give some guidance on establishing demographic group membership. That's sometimes challenging, particularly we talk about things like race categories across international countries.
Guidance on using what's called phenotypic measures, which are more observable as opposed to reported characteristics of individuals. I think that's a good thing. It does (indistinct) did as well, which is continues to define these terms and definitions so when we say things like demographic differential, we're all talking about the same thing. And then it gives some requirements on sort of, again, from a math level, how do you do these tests? What kind of statistical techniques do you use? What kind of formulas and things like that? All of that's sort of currently being iterated on.
Next slide. And then this is sort of the last part of the scope, and I'm bringing this up just because I think some of the people on the call, this may be interesting to you. And if it is, I highly encourage you to reach out to Arun and sort of get involved with this standard.
We're always looking for additional partners to help craft this and to take input. But so outside of the scope and the definitions that I went through on the last slide, I mentioned this phenotypical measures, and then the last two we think are really important, right? So it's how and when do you do demographic testing? I think Yevgeniy laid out a really compelling case study and why it's important to do these fairly often, because as things change on the ground, the results of your demographic equitability study could change as well. And then, okay, so you've decided to do an equitability study, what do you actually need to report out? What do you need to sort of tell people to give them the confidence that these systems are working equally well for all groups of people? That'll also be part of the standard. So next slide. Okay, and I think we're gonna end here and just note that Yevgeniy talked about the 2020 Biometric Technology Rally that we did sort of mid-pandemic.
We actually just executed another one a couple months ago. This was in October. And this one is very similar to the rally Yevgeniy talked about where we're looking at masked and non-masked performance with these acquisition and matching systems.
For the first time, this 2021 rally is also explicitly looking at biometric performance across demographic groups. That hasn't been an explicit goal of us in the past and part of the reason is because we didn't have sort of some of these ideas standardized about how to do that. So as the standard has progressed, it's allowed us to migrate some of those topics into our scenario testing model as well. Reports on how this went will be coming out shortly and so I encourage everybody to check back with Arun and with the Tech Center and get the results of those studies.
And with that, I think that's my last slide. We can turn it back to Arun or open it up for questions, whatever you wanna do. - Okay, I have a question in. It's a question, I'll read it off. "When you looked at iris matching, did you have any mismatches of people across eye color types? Or were they of the same demographic groups of the same eye color?" Think maybe, John, you wanna take this one? - Yeah, I can take this one. It's a really good question actually.
So eye color, we mentioned age, race, and gender as demographic groups that presumably would affect face recognition. Eye color is one we didn't bring up on this slide, but that's one that might affect, you could reasonably presume it would affect iris recognition. I don't actually have an answer to the question because we didn't pull that particular piece of information, but I will just kind of mention that (indistinct) these iris recognition algorithms work is one, they're all looking at irises in what's called the near-IR range, so outside of actual visible light, and so they sort of look like black and white images to begin with. And then the second point is sort of the way that iris matching works uses these patterns called Gabor wavelets. They're not really consistent between eye color so we wouldn't expect to find that there, but it's, again, a good question and something that you could reasonably assume might happen with iris recognition.
- Yeah, I'll add one thing, John, to this reply, is that, you know, the choice of demographic groups is really important and it's something that I think the standard also, this -10 standard that John mentioned, it will address. But, you know, which demographic groups should we worry about assessing? Because, you know, ultimately you could imagine creating a demographic group of people for whom the technology works better than for others, and that could be a demographic group on its own right. But we have these protected demographic groups in different jurisdiction and those are really the ones that we've been focusing on far. But it's a great question because different technologies may have different demographic (indistinct). And I've got the next question here. And this question is, "What are the confidence intervals on identification relative to the individual features used in the facial recognition algorithm for different demographics?" And I think what the question is trying to ask is is there a difference in the confidence scores maybe of the identifications based on different demographic groups, or the variability of the measured identification performance across demographic groups? The charts that we had on the screen today, we did not put error bars on those charts.
So our sample is typically numbering in the, I say for folks identifying as Black or African American, we had a sample, you know, in the hundreds, and for folks that self-identified as white as well. So those confidence intervals will typically be in the order of a few percent. And so if that addresses the question, or if not, I apologize and maybe I misunderstood it. - Yeah, maybe just to add on here, and again, please rephrase the question if that wasn't it, per Yevgeniy there, but the general question about how do you put confidence intervals on these numbers I think is actually a very important one, right? So we usually report out things like false match rate, false non-match rate, and asking the question of how similar these things need to be to be sort of be determined as equal is a really challenging one in some situations. And it's one of the things that also the standard attempts to go into to sort of give some guidance on, you know, okay, so you have two different numbers from two different rates.
You know, can you say it's operating at statistically the same rate across those groups? Really good question, really hard question, actually, to answer too. - So a clarification that came in with the question is, the clarification is the features used, iris, ears, nose, mouth, do the confidences associated with different facial features vary differently for different demographics? And I think to answer that question, I would have to say that we simply don't know. These face recognition algorithms are essentially black boxes, at least the modern ones. They take the entire face image and they perform a complex, convolutional operations on this image so that we can't tell whether or not a particular specific feature is driving that score. But if you look at our technical paper, we're actually trying to unravel at least what kind of features these algorithms might be using that are related to demographics. So we can't pin it down to like, oh, it's the nose or it's the ears, but we can say that we have evidence that they're using some sort of a set of features that are linked with demographics because we see some patterns in the way that this algorithm performs across specific demographic groups.
So, unfortunately I can't answer the question directly. John or Arun, do you have anything? - Yeah, I was just kind of, so I had to drop off and come back on, but I think, you know, one of the points to make here too is most of these modern technologies, we've seen this massive improvement with facial recognition algorithms in the last couple of years, largely driven by the adoption of AIML technologies into this space. But with the adoption of the AIML, we don't exactly know what features these models are using all the time. And we can try to go back and dissect that a little bit but we're limited because these are trained models that are commercial and proprietary, right? So there's only so much insight we can actually get into what's going on within the models themselves. So it's really hard to pinpoint what are the features let alone whether the features that are more salient vary between different demographic groups. - Okay, next question, "Regarding the system that performed the best on the unmasked faces, is it possible that this system was using some ocular inputs?" And I think the answer is yes.
I think it certainly, again, as Arun pointed out, these are sort of black box systems and we don't know exactly what features they're using, but it's absolutely possible that the system may have been using what we call periocular information, sort of information around the eye region. But it's unlikely that they're doing something that is akin to iris recognition just because the irises are such small portions of a face recognition type image. But yeah, so the answer is yes, it's probably using some periocular information. - And I'll just kind of add onto it.
It's probably not using iris information. As John mentioned, iris is in the near infrared, right? And the features that are kind of discernible on that domain are very different than what would appear in the visible domain. So, you know, it's almost certainly using information here where the algorithms are saying, when there's a mask, maybe I'm weighting these features differently than I would be if the person's not wearing a mask at all. - Yeah, I think it was interesting that, you know, if you were gonna do a study of a black box seeing an algorithm, as Arun pointed out, to figure out what features it was using, you know, what you would do is essentially start masking different features out and running recognition performance and seeing when, you know, masking out the nose caused the scores to go down.
With the COVID pandemic and the application of masks, we actually had this really interesting opportunity to sort of do a natural experiment there and say can face recognition system still work when we've removed facial information from the lower part of the face? And, you know, I think the results that Yevgeniy presented sort of led us to the conclusion that it can. And so it must be using, again, not iris-like features but ocular-like features, I think that's a very good assumption. - So we have more questions. Here's the next one.
"Has there been or is there planned any research on genealogy diversity versus performance? In other words, do some demographics have greater genealogy diversity and therefore a greater difference in facial features than others that underpin some of the performance differences?" So I'll take a part of that and I know that John will wanna weigh in. So I talked about false non-match rates in my portion of the talk. And we believe that a lot of the errors and failures to match can be tied to some of the acquisition components of the systems.
So for instance, I showed you that the predominance of the errors made in this scenario test wasn't actually due to matching, was primarily due to image acquisition. And we think that at least one component that is responsible for that is the quality with which cameras can image different shades of a human's skin. And there's a number of pieces of research out there showing that, you know, that really can affect, you could be overexposed or underexposed, depending on exactly how that camera is set up and we think that's really important. So there, I think, it's this acquisition part and the way it interacts with the skin phenotype that's important, but when it comes to that other kind of error, false matches, that's where I'll pass it over to John 'cause I'm sure he'll have some thoughts.
- Yeah, this is a absolutely fantastic question. There is almost certainly something happening here that is more deep than simple race self-reporting. The simple answer to your question is there has been some research on this but not really at the genealogy level.
So we know, for example, that twins who share underlying DNA give face recognition a problem, right? That face recognition algorithms and humans will think that twin A and twin B look very, very similar. We also know that taking it sort of one step further removed from genealogical identity, parents and siblings also share facial features, so there's a genealogy link that comes across in facial recognition. We don't know, and it's really important in my mind, an area of research, where that sort of genealogical breakdown stops, where people stop looking similar because they happen to share genealogy.
And so the planned part of your question is it's sorely needed. We don't have anything right now to sort of work on this. Although there has been a little bit of work in doing, like, reconstructions from DNA to face, but we haven't done any of that to date. But we need to. It's a really good area to look into. - Yeah, one thing I'll add here is that I think the question asks could there be greater diversity of genealogy in some groups than others? And I think, you know, as a neuroscientist, you know, we all needed to recognize conspecifics, as John mentioned, we need to tell friend from foe, we need to tell apart the individuals in our groups and individuals outside our groups. I don't think that those evolutionary pressures have ever been different.
So it's not clear, you know, and we can all achieve these tasks well regardless, you know, what our origins are. So I think that really these are, the questions that we raised on the false match side with face recognition is that, you know, people that are similar genetically to each other, like twins, are gonna be similar in their face characteristics. But I don't know of any evidence of differences in diversity for specific groups.
So we do have other questions. "The science here is potentially evolving past the public debate, and will these findings be published? And if so, where? They could be helpful." So I think John mentioned that we have publications available on the Biometrics and Identity Technology Center website. And a lot of this research is also published in academic journals as well. So I believe you could go to BI-TC website today and download this technical paper series that John briefed, so that's available. And the components of the demographically disaggregated analysis that I briefed earlier are available in brief format but have not been explicitly published.
For the 2021 rally, it's one of our goals to brief and to make this demographically disaggregated analysis of commercial technology available as well. Arun, do you wanna weigh in here? - Yeah, I'll just point out that we have, so from our Biometric and Identity Technology Center page, you can also find a link to our MDTF page, mdtf.org, and we have a number of papers there that are linked as well. You know, there's this constant balance we're trying to strike about putting out good content and material and analyses here, and also going through like, peer-reviewed processes, right? Peer review is not always a timely process.
It can take a lot of time and sometimes it's just because the editors are busy, right? So we are trying to make sure we are putting out information in a timely basis, that it's informative and useful, but yeah, it is also beneficial to get a peer review but sometimes, and that's what we've almost exclusively done in the past. It's just that that process was just taking so long and it was preventing us from helping to get content out into some of the public forum, 'cause otherwise people send out misinformation via tweets in seconds. It's very hard for us to go through a peer review process and then put something out to contest that when our processes are so much longer. So we do go through the internal, so we do go through S&T processes to review it before we publish it. But anyway, so we're trying to do a combination of both so we can make sure better information is available to people who need this information to help inform public policy and public debate on these topics.
I think we have a couple more questions here. - [Yevgeniy] Yep, so the next one here is, oops- - I sent it away, sorry. It's on the published, I'll- - Okay. Go ahead, Arun, please.
- "The other race effect has been in literature for a long time. Wouldn't a way of mitigating equitability be to have a more diverse training set?" - So I'll start with this one. What we showed, what John showed on his slides about face recognition is not the other race effect. The other race effect, it says that if I am raised in an environment where I'm exposed to people of a certain demographic group, that I am better at discriminating faces that belong to that same demographic group, and usually it's my own demographic group.
What John was talking about is that face recognition in general has a greater propensity to confuse people that match in race, gender, and age to each other, which is a very different thing. And it's something that took a while to wrap our heads around because of this cognitive bias that we have that says, hey, faces are more similar to us perceptually. But that's because we have this neural circuitry that tells us so.
We don't have this type of neural circuitry or intuition for iris recognition or fingerprint recognition. And in fact, these systems don't make those same kind of demographic, the demographic confusion matrix of those systems looks very different. John, do you wanna add to that? - Yeah, so this is actually a really clean case of where diverse training set certainly helps but it doesn't solve this problem. So we hear a lot from people that are in the face recognition space that, oh, if we only had a more diverse dataset, a lot of these problems would disappear. And here I think is a great sort of counter example. So imagine you're sort of training a face recognition system and you set the optimization parameters, or you are telling the computer program what doing a good job looks like.
And in most ways that these things are trained right now, it has two objectives. It needs to think pictures of me look like other pictures of me, they get high scores, and pictures of me and other people get low scores. And you sort of say, okay, computer program, neural net, accomplish these two things. And if you accomplish these two things, I have a working face recognition system. What this actually says is there's a third criteria so even getting a more diverse dataset wouldn't solve this problem.
You need to add this third optimization parameter and it's that the person who looks most like me shouldn't also share my demographic group, shouldn't be a 30-year-old Caucasian male. So it's a great example of where, yeah, a diverse training set would help but you'd then also need to make this recognition that you have to update your optimum steps and add this new thing. Short answer, yes, it would help, no, it doesn't solve this particular problem.
- [Yevgeniy] Yep. And we have one more. - Oh, I'm sorry. - Go ahead, Arun. - Did you guys go over, like, some of the reasons why we're looking into this in particular, the whole thing about protected classes and trying to have equitable performance across...
Yep. Okay, nevermind. - Yeah, so we didn't go through it in detail and I think it makes sense to do that now. Yeah, Arun, so protected groups came up a little bit earlier and I wanted to point out a ramification and I don't think we put that slide into this brief. But there's a ramification that people need to consider about face recognition making these more confusions between people of the same race, gender, and age. And these happen because let's say I have a gallery of people who share my demographic characteristics.
Everybody is, let's say, a white male of a certain age. And if I am not in that gallery but other people that look like me are, and I go to match against that gallery, then, in general, if you have an accurate system, my likelihood of a match will be low but that low likelihood of a match may still be higher against that gallery of white males than, say, an Asian female against that same gallery, right? So I would have a greater hazard of matching falsely to this gallery of white males than somebody else that doesn't share my demographics. And in many circumstances, that could be considered, you know, not desirable or unfair, because if that gallery is something that, you know, perhaps I don't wanna match to, maybe it's a list of people that would not be allowed to fly that day, then I would have a greater sort of hazard of having that error occur. And with face recognition, that could be something that we need to worry about. So I'll go through the last question here, we're running up on our time, and that last question is, "Have infrared systems been used in face recognition? And are those system less prone to demographic effects?" And I think the answer is yes, and we've actually, in the rally, have had some systems that have used infrared light for their acquisition. We did see some different demographic characteristics but it's a very small sample.
Most face recognition systems today that have participated in our testing have been visible spectrum face acquisition systems using typical RGB sensors. But it is an open question of what would happen to these acquisition demographic effects if infrared systems were used. So that takes us through the end of the questions.
Arun, I'll hand it off to you. - Yeah, thank you so much. So Yevgeniy and John, thank you so much for doing the call, for sharing the webinar, and thanks to all the participants who joined us this afternoon to learn more about some of our research and learn some of the work that we're doing. If you have any questions, please feel free to reach out to me within the Biometric and Identity Technology Center. You can always get me on Teams or email. In fact, I think I just saw a couple of emails come through, so I'm happy to help follow up and help answer questions and share any information as it's relevant.
So, yeah, thanks again and please, please keep in touch.