Note to Press: Facial Analysis Is Not Facial Recognition

Daniel Castro January 27, 2019
January 27, 2019

(Ed. Note: The “Innovation Fact of the Week” appears as a regular feature in each edition of ITIF’s weekly email newsletter. Sign up today.)

According to multiple news sources, including The New York Times, Amazon is peddling racially biased facial recognition software to the unsuspecting public. But these allegations are more fiction than fact, as they confuse and conflate two similarly named, but otherwise very different technologies—facial recognition and facial analysis. A closer look at the details of the story shows that the headlines are not supported by the evidence.

To begin, a quick recap. The MIT Media Lab published a report last week evaluating the performance of various commercial gender-classification tools—software that predicts whether a face in a photo is a male or female. The lead researcher, Joy Buolamwini, had previously published research showing that commercial gender classification systems performed significantly worse on darker female faces than on paler male faces. The new study evaluated the extent to which some of the commercial systems, such as those from Microsoft and IBM, had improved since the earlier research and compared these results to the performance of two systems, from Amazon and another company, Kairos, that had not been included in the initial study. 

It is a fair critique that some facial analysis systems have performed less accurately on darker and female faces. But the study is still misleading because it overstates the magnitude of the problem based on how it reports false positives. The problem is that the results ignore the confidence levels associated with predictions. In a binary classification tool such as this, where there are only two possible responses, most tools will return not just a prediction, but the level of confidence associated with each prediction. For example, Amazon writes in its documentation:

“Each attribute or emotion has a value and a confidence score. For example, a certain face might be found as ‘Male’ with a confidence score of 90% or having a ‘Smile’ with a confidence score of 85%. We recommend using a threshold of 99% or more for use cases where the accuracy of classification could have any negative impact on the subjects of the images.” 

Similarly, Kairos emphasizes this point repeatedly in describing its system, writing:

“During the enrolment phase, the subject's gender is analyzed. The API does this by comparing the photo in question with thousands of other photos. From this, it determines the likelihood of the subject being of a particular gender at a stated confidence level.… Again, the result returned to the developer is couched in terms of a percentage confidence that each face is a particular gender.”

But the MIT Media Lab study ignores confidence levels. Instead, the study treats all gender predictions the same, making no distinction between results with low confidence scores and those with high ones. As the study itself acknowledges, ignoring confidence levels significantly skews the results. IBM replicated the first study and found that its error rate was only 3.46 percent compared to 16.97 percent reported in the initial study.

But the more serious problems arise in the reporting about this study by the media. For example, The New York Times writes:

“Over the last two years, Amazon has aggressively marketed its facial recognition technology to police departments and federal agencies as a service to help law enforcement identify suspects more quickly.… Now a new study from researchers at the M.I.T. Media Lab has found that Amazon’s system, Rekognition, had much more difficulty in telling the gender of female faces and of darker-skinned faces in photos than similar services from IBM and Microsoft. The results raise questions about potential bias that could hamper Amazon’s drive to popularize the technology.”

First, like virtually all coverage of this study, this article conflates “facial analysis”—where a computer system predicts features such as age, gender, or emotion based on a photo—with “facial recognition”—where a system matches similar faces, either by searching for similar images in a database (i.e., one-to-many) or by confirming whether two images match (one-to-one). They may sound similar, but they are as different as apple trees and apple sauce. Dr. Matt Wood, general manager for AI at Amazon Web Services, explains, “Facial analysis and facial recognition are completely different in terms of the underlying technology and the data used to train them.” The study itself makes virtually no mention of facial recognition, yet that is the primary focus of The New York Times.                            

Second, articles like this confuse the past with the present. While the study was published only last week, the authors ran their tests in August 2018. But that is not the version of the software that is currently on the market. Amazon announced in November 2018 that it had made major updates to its facial analysis software, as facial analysis and recognition companies do on a regular basis. And Kairos announced an update to its system in October. While these facts are acknowledged in The New York Times article, they are overshadowed by a headline that ignores the nuance, instead blaring, “Amazon Is Pushing Facial Technology That a Study Says Could Be Biased.”

Third, these articles suggest that the study has direct implications for law enforcement even though it gives no evidence that the gender classification tool has any substantial uses in law enforcement. Long-standing privacy hawks, such as Sen. Ed Markey (D-MA), even took to Twitter to repeat this unsupported linkage, adding, “Adoption of flawed facial recognition technologies by law enforcement could literally hardwire racial and gender bias into police depts.” This claim is bizarre and unconvincing as police do not use facial analysis software to determine the gender of people they encounter.

Unfortunately, stories such as the one discussed here discourage adoption of new technologies and distract from legitimate policy solutions. For example, there are serious opportunities to improve the accuracy of facial recognition and facial analysis technologies across different demographics, such as having the National Institute of Standards and Technology (NIST) develop a representative dataset of facial images and expand its testing of facial recognition and facial analysis systems to include cloud-based solutions.

While disheartening, the reporting on this study is not unusual. Many technologies often follow a “privacy panic cycle” where public fears about a new technology outpace understanding of the benefits, and these fears are often driven by some in the media amplifying misleading and hyperbolic claims even when they aren’t supported by facts. Indeed, ITIF has chronicled how the tone of media coverage about technology in general has been in decline.

The one glimmer of hope is that when it comes to facial recognition, public opinion seems to generally support the technology when the benefits are clear, such as for increasing public safety, improving device security, and expediting security lines at airports.