Building Responsible AI for Voice Developer Platforms - Joel Susal | AI Summit New York 2021
(bright music) - Hi everyone. I'm Joel Susal. I'm the Head of Product for Rev.ai.
I wanna give just some brief background on me, and then I wanna tell you about Rev, the company I work for, and how we fit into this equation. So for some brief background on me, I've spent the last 10 years at the companies you see here on the screen. I led Mobile Product at Dolby, and then went on to be the founder and general manager of Dolby's virtual and augmented reality business. I led Enterprise Product at a background check company called Checkr, and for the last year, I have been the Head of Product at Rev.ai, the developer platform from Rev. So I wanted to talk to you about Rev, introduce what we do, how we fit into the AI ecosystem, and what we've found.
Some meaningful findings we've uncovered in the realm of mitigating bias in AI. I'm not claiming that we've solved the problem, but we have made some meaningful, positive progress, and I wanna share our findings with the industry, so that we all can build a more responsible, reliable, and scalable future together. I will point out that I am a white male, and I will be speaking about bias, and white males have not been disproportionately disenfranchised by AI like some other groups have. I'm here today, to share some information about our approach, to share some data about how our approach moves us forward, but these are first steps. Our progress is exciting, but there's more to do.
I beseech you. We as an industry, need to collaborate and commit together to building a sustainable, scalable, and fair solution in our respective industries. So, Rev is a company that started over 10 years ago, with a mission to create great work-from-home jobs. It's kind of an interesting mission, partially because the beneficiary of that mission is the worker. It's not the shareholder, it's not the customer, and it also gave us a lot of license to try and build a number of different businesses, which we did, but where we've landed and invested heavily in the last 10 years, is in the transcription space. We have our product, Rev.com,
that attracts 60,000 freelancers, we call them Revvers, who do excellent transcription, captioning, and subtitling for our customers. Part of making a great work-from-home job means really cultivating a sense of rewarding work for these Revvers. We have created ways for Revvers to level up and earn more money, in exchange for doing good work, and so, as a result, we are a leader in the space. Our customers are happy. We have a healthy revenue stream, but there's something even more exciting that is a by-product of our success in the human transcription space, and that is, I'd argue, the best data set in the world for training an AI based ASR, automatic speech recognition solution. It is a by-product of us earning money, and so we've amassed tremendous hours of high quality data.
It's audio in, expert transcription out. We have millions of hours, and while I'm speaking to you today, we are adding even more hours to that dataset, and as a result, our Rev.ai solution is, we believe, the most accurate in the world. One thing I will say is that, of course, we also use this engine to make our Revvers even more efficient, and so there's this circular exchange of value happening between AI and human, that happens underneath the hood of everything we do at Rev, and more on that in a second. So welcome to the AI Summit. Welcome to the responsible AI track, and we're here to discuss a few, a couple, high-level questions.
One, why AI needs to be front of mind for business leaders like yourselves, and two, what to do to accelerate the AI journey, and the essential considerations that, frankly, need to be addressed in order to do that in a safe and fair way. Well, I have some thoughts and some experiences on both these topics. So let's dive right in.
So why does AI need to be front of mind for business leaders? Well, according to Daugherty and Wilson, in their book, "Human + Machine", AI represents the third wave of business transformation, as it relates to human empowerment. They posit that wave one was characterized by standardized processes, and things like the assembly line that were commercialized at scale by the likes of Henry Ford. The second wave being more automated processes, leveraging technology to build process and workflow automation, and this is, I would characterize this as the nineties, and it's when companies like UPS and Walmart proliferated because of their back office optimization and excellence. So now, we are in the third wave, which is characterized by adaptive processes where real-time data is used, instead of an a priori set of steps. It's adaptable. Where the first two waves were characterized largely by machines replacing humans, this third wave is described as mutually complimentary interactions between humans and machines.
The third wave doesn't replace humans, rather, it puts them to work in new and different ways. Daugherty and Wilson call this the "missing middle". If you see on the left, we have human activities. On the right, we have machine-only activities.
What's new, what's distinct about this wave is that there is this middle where machines and humans are working seamlessly together to empower both sides of that equation. Machines perform the repetitive tasks, while humans train, improve, and place guardrails around the machines. So this third wave could represent an entirely new industrial revolution with orders of magnitude improvements in efficiency, cost and scale. It represents tremendous opportunity.
I know you believe that, because you're here, but it also introduces a new set of challenges, and those challenges come as we place more, much more decision-making power in the hands of machines who, ironically, have no hands. So one of the biggest emotional challenges is related to trust. Trust that the machine is going to do the thing you expect it to do. AI is at a disadvantage here, and some of that mistrust is emotional, and some of it is justified. On the emotional side, it turns out humans lose trust in machines more quickly than they do in humans.
There's a term called "algorithm aversion", and if you've ever clicked zero or screamed "operator" into the phone, even if it means waiting an hour, then you have exhibited algorithm aversion, but I'm sure none of the technology elites in this room have ever resorted to Luddite practices like that. Second, is that humans are more sympathetic to mistakes made by other humans. Gill Pratt, the Chief Executive of the Toyota Research Institute, told lawmakers on Capitol Hill in 2017, that people are more inclined to forgive mistakes that human make, than mistakes made by machines.
This is also, I called it an emotional reason, but it's also justified. I'll tell you why. It's understandable. Humans make different mistakes than machines do.
We know this. We experience it, because we operate a human based transcription service, as well as an AI based transcription service. While both of these services can be measured using something called "word error rate", and in fact, that is how they are measured, humans tend to omit words in places that listeners or consumers find acceptable. Times when there are interruptions or there's crosstalk, where, perhaps, in our brains, we have to do our own filtering. Whereas machines might make a mistake in somewhere that is less confounding, and therefore, sticks out more, but both of those mistakes show up as just one mistake in the calculation of word error rate, and so our yard stick is a little bit biased, but the main point is that humans and machines do make different mistakes.
So we have to build trust. We have to be able to rely on these machines to make rational decisions all the time, and all the time is a very tall order, and one of the biggest reasons for that is the notion of bias, which I'll explain now. So first off, I'll start by saying that every human exhibits bias. To each his or her own, but we all have them. If you look at the picture to the left, and envision a call center with thousands of agents, each exhibiting his or her own biases, what you have is actually, a very well diversified, high-variance output, as it relates to bias on any specific input. It's expensive and complex, but it is diverse.
Now, if you train an algorithm to replace, or augment, these call center agents, which is depicted on the right, what you get is a scalable, cost-effective solution. But if you're not careful, it won't have that same glorious diversity that you had with thousands of independently biased agents. Whatever bias is present in that algorithm will be applied, at scale, to all of its inputs. So there are three ways where bias gets introduced into AI systems.
Number one, in the input training data set itself, and if it lacks diversity, then you're introducing bias. Number two, the ground truth that is derived from the data, and if that process lacks diversity, then you're introducing bias, and the third is the developer building the product, and that developer's inability to analyze performance of their system on anything other than the metrics that they have in their original training set. So there's a litany of problems. Let me go through them one by one.
Training data, I'll tell you from experience, training data is hard to come by. It has to be large in volume. It needs to be high quality, and you know, the old adage, "Garbage in, garbage out"? Well, not surprisingly, in an effort to ship product quickly, we often fixate on how much data is the minimum needed to ship an MVP, and only later, and only sometimes, do we iterate to make improvements. Data is hard to find, and speaking from experience, when you find a large cache of usable data, you tend to want to use it. The problem is that this data is not always representative and as diverse as the ultimate intended use case.
Looking to build a French speech recognition model for use in Quebec, I'll tell you it's pretty hard to find lots of high quality training data. If you do, it may come from only a few sources like television broadcasts, earnings calls, podcasts, university lectures, which often have limited representation in terms of demographics, socioeconomic backgrounds, and other traits. To be clear, these all represent vocal outputs from industries that are historically dominated by white, upper and middle-class males.
So they are overrepresented, easy to overrepresent in this example. Second step is, assuming the data isn't already transcribed, and by the way, if it was already transcribed, then everyone else in the industry would be using it too, and you wouldn't have any differentiation. So now you have to transcribe it with as much accuracy as possible.
Now is when dialects, accents, and other confounding factors can interact positively or negatively with the biases of whomever you hire to generate the training transcript, and again, bias is introduced. Finally, third step, you have to build a system that's ultimately going to be deployed. You have developers making decisions, and not all of whom are conversant in the intricacies of the French Canadian language, or the difference between Romanian and Latvian languages, or simplified Mandarin, compared to traditional Mandarin.
You get the picture, and yet these developers are tasked with building a system that's suitable for use in a real-world environment. If there was a lack of diversity in the sourcing and training data, step one, you can bet that there's, most likely, a lack of diversity in the test set as well. Because of these challenges, we need to up-level our thinking. We need to not be thinking about "Garbage in, garbage out", rather, we need to now be thinking about "Bias in, bias out".
So the consequences, well, they can be quite damaging. As ASR gets deployed in hiring scenarios, if one of the groups you fall into is underrepresented in some way, whether it's accent, gender, age, your voice's acoustic tone, your speech patterns, and that list goes on, you may not get past that initial AI powered interview screen. Your testimony in court may not be accurately captured.
Your order at the drive-through fast food joint may tend to be wrong more often. As AI becomes more ubiquitous, you will be increasingly misunderstood, which is dehumanizing. A Stanford study found that five leading ASR engines, Rev.ai was not part of this study, perform significantly worse on African-American vernacular English than on white English. They found that the average word error rate, across these five services that they measured, was nearly double for audio samples of black speakers, compared to white speakers.
Another example, an Irish native recently was denied citizenship in Australia after failing an English speaking test, which is clearly a false positive or false negative, depending on if you're a glass half-empty, half-full type of person, but there is some promise. It sounds bleak, but there's promise. I'm not gonna claim that we've solved it, but we've found some very promising techniques for mitigating these problems.
We're fortunate at Rev.ai, to be comparatively, significantly better than other ASR vendors out there. Like I said before, our work is not done. There's a long way to go, but let me tell you what we've observed. We worked with one of our customers, HireVue, to analyze our platform, Rev.ai, against other major ASR providers in this space.
It was an independent analysis where HireVue used 800 audio clips, varying in length between one minute and five minutes, and which were divided, more or less, equally, at least 200 clips each, across different ethnic groups that they labeled White, Black, Hispanic, and Asian. They also divided these samples by age, over the age of 40, under the age of 40, and male versus female. Our Rev.com business, powered by our diverse group of Revvers, provided the ground truth reference, and our Rev.ai ASR engine,
was one of the few candidate engines that were tested by our partner, HireVue. The test was initially conducted 14 months ago, but it was just refreshed last week, and the reason they did that is that we have a new, next-generation model that's in beta testing now. We have Rev 1 and Rev 2. Rev 1 represents our production model deployed today. Rev 2 is our model that's in beta, and you'll see both of those data sets in just a moment. We see, and you will see too, that there's really some exciting promise in this new enhanced model.
So let's get into some data. First off, I wanna just say, this is a word error rate plot. I wanna describe what word error rate is. You take the number of insertions, which is adding words to a transcript that should not be there, deletions, which is removing words from a transcript that should have been there, omitting words, and substitutions is getting a word transcribed incorrectly.
So the incorrect additions, the incorrect omissions, and the incorrect modifications are your errors. If you divide that by the total number of words, you get a word error rate. So I wanna also hit home, because this is an error rate.
A lower score is better. It's kind of the inverse of accuracy. Zero error is 100% accuracy.
So what we see here, is a comparison on gender, male versus female, across Google, Amazon, our production model that was deployed as of 14 months ago, and our new model, Rev 2. What you can see is that in all four of these models, they all perform better on females than on males. Most importantly, this shows a stark improvement. I think what stands out to me is the differential between Rev 1, and Google and Amazon, and the fact that Rev 2 has improved further upon Rev 1. Next, let's look at some samples broken down by ethnic groups. Again, these are samples labeled by our partner, HireVue, across groupings that they labeled Asian, Black, Hispanic, and White.
A few interesting things to note here. First off, all engines performed best on the samples labeled for white speakers, and they all performed worst on the samples labeled as black speakers. The fact that these data points are not all on top of each other, that they're not entirely coincident, that's a visualization of ethnic bias. It exists in all of these engines. What you want to see is them as clustered as possible, as close together as possible.
And again, lower is better. So we're really happy with the results that show that our latest engine places our worst performing ethnic group, better than our competitor, Google's, best performing ethnic group. Again, there's still work to be done, but we're showing some very exciting progress. Finally, I wanna show you some data on accents, based on country of origin. So this test was based on, all of the tests, were based on the English language, but now it was compared to English as spoken by natives of the different countries that you see listed on the X axis.
What you can see here, the green scatter plot, Rev 1.0, our current production model, outperforms Google and Amazon for all countries tested. What's really exciting is that our next generation model is uniformly and significantly better than even that. So let me wrap up with some bold and prescriptive statements. First off, I have a top recommendation, and that is to prioritize diverse training data. One reason for that is that your training data will likely be partitioned into the set that you use for training, and you'll set some aside, so that you can use it for testing, independent testing later, and the fact is, if you have a really diverse training set as an input, you're solving two of the three problems that I mentioned before.
You're solving the diversity in your training set, and you're empowering your developers with the tools and visibility, so they're not flying blind when they're evaluating whether what they're building is fair, equitable, and responsible. Related to that, we have found that generating real world data is far superior to buying, or trying to buy, a diverse data set. By using real world data, your system organically captures new words, accents, and other trends. It's near impossible to write down what an ideally diverse dataset looks like. So having a customer's real data is the most natural and self-correcting way to train an AI model. Next, make sure that your data refinement and ground truth is derived from a diverse and reliable source.
We have found our 60,000 Revvers to be, they are a very strong source of diverse and reliable data. By the way, as of today, you can place orders for both our Revver ecosystem, as well as our AI powered ASR solution from the Rev.ai API. So I'm excited to announce that. Third, test, refine, listen, invest, repeat.
This is the missing middle. AI is here, but it's not a complete solution yet. The missing middle, it is real, and we're in the wave.
We are in it where humans and machines are working together, and they're working to improve one another. At Rev, we've seen the fruits of how this approach can make both better. Remember when I mentioned that our AI powers our Revvers to be more efficient, and our Revvers provide high quality, accurate, diverse transcripts to train our AI. That is the missing middle, and it has shown to be fruitful for us.
Finally, if you use ASR in your business, and if you care about quality and bias, which you should, and which we do, please come talk to us. We're at booth number 345, right outside. We're happy to talk.
And with that, I would like to thank you. Thank you, and have a great conference. Have a happy new year, and we'll talk to you soon. (audience applauding) (bright music)