Setting the Record Straight: De-Identification Does Work
When implemented properly, de-identification can enable the benefits of data analytics without threats to privacy.
In the coming years, analytics will offer an enormous opportunity to generate economic and social value from data. But much of the success of data analytics will depend on the ability to ensure that individuals’ privacy is respected. One of the most effective ways in which to do this is through strong “de-identification” of the data – in essence, storing and sharing the data without revealing the identity of the individuals involved.
A number of researchers have been investigating techniques to re-identify de-identified datasets. Unfortunately, some commentators have misconstrued their findings to suggest that de-identification is ineffective. Contrary to what misleading headlines and pronouncements in the media almost regularly suggest, datasets containing personal information may be de-identified in a manner that minimizes the risk of re-identification, often while maintaining a high level of data quality.
Despite previous efforts to dispel the myth that datasets cannot be reliably de-identified no matter the methods employed to de-identify the data, this view continues to be promulgated. It is increasingly apparent that one of the reasons for the staying power of this myth is not factual inaccuracies or errors within the primary literature, but rather a tendency on the part of commentators on that literature to overstate the findings. While nothing is perfect, the risk of re-identification of individuals from properly de-identified data is significantly lower than indicated by commentators on the primary literature.
At the same time, advancements in data analytics are unlocking opportunities to use de-identified datasets in ways never before possible. Where appropriate safeguards exist, the evidence-based insights and innovations made possible through such analysis create substantial social and economic benefits. However, the continued lack of trust in de-identification and focus on re-identification risks may make data custodians less inclined to provide researchers with access to much needed information, even if it has been strongly de-identified; or worse, to believe that they should not waste their time even attempting to de-identify personal information before making it available for secondary research purposes. This could have a highly negative impact on the availability of de-identified information for potentially beneficial secondary uses.