
Broadening Genetic Representation in Biomedical Research Data
The United States leads the world in biopharmaceutical innovation, providing its citizens with speedy access to novel drugs—more than half of which are launched first in the United States before becoming available in other countries. Yet ensuring that these therapies benefit the entire U.S. population requires a strong commitment to biomedical research that is representative of the entire, rich demographic makeup of the country.
Efforts to Enhance Diversity in Biomedical Research
Historically, genomic research has overwhelmingly focused on individuals of European descent, creating significant blind spots in drug safety and efficacy for underrepresented groups. When genetic findings are not tested across diverse ethnic populations, treatments that work well for some may be less effective—or even harmful—for others. As of 2018, genome-wide association studies (large-scale genetic studies that analyze variations across individuals to identify disease-linked genetic markers) were heavily skewed, with 78 percent of participants being of European descent. Similarly, 88 percent of participants in the UK Biobank, the world’s largest whole-genome database, are white. This lack of representation extends to clinical trials as well. In 2020, the Food and Drug Administration (FDA) reported that 75 percent of clinical trial participants were white, with Hispanic, Black, and Asian individuals making up just 11 percent, 8 percent, and 6 percent of participants, respectively.
This underrepresentation in biomedical databases and clinical research can have serious public health consequences. To address these disparities, the National Institutes of Health (NIH) launched the All of Us program in 2018, aiming to build the world’s most diverse genetic dataset. This initiative is a crucial step toward identifying genetic markers that influence disease risk and treatment response across different U.S. demographics. Today, roughly half of the All of Us data comes from individuals of non-European descent, filling critical gaps in research and improving the ability of scientists to develop treatments that work across different populations.
The Consequences of Underrepresentation in Research
Underrepresentation in biomedical research has real-world implications for public health. Some diseases disproportionately affect certain racial and ethnic groups, and genetic differences can influence how individuals metabolize medications, impacting both safety and efficacy. For example, sickle cell anemia—an inherited blood disorder that primarily affects Black individuals—remains significantly understudied compared to other hematologic conditions, despite being the most common inherited blood disorder in the United States. Similarly, common medications like albuterol, a widely used asthma inhaler, are less effective in Black children due to genetic differences that alter treatment response. For years, this disparity went undetected because 95 percent of lung disease studies were conducted exclusively on individuals of European descent. Today, Black children in the United States face an asthma mortality rate that is 2.5 times higher than that of white children—an alarming statistic that highlights how research gaps can exacerbate health disparities.
Diversifying biomedical research can lead to better public health outcomes. A study on type 2 diabetes underscores this point. Researchers analyzed data from over 2.5 million individuals, including 40 percent of participants of non-European descent, incorporating data from NIH’s All of Us. Their findings identified 611 genetic markers influencing diabetes progression, 145 of which had never been documented before. Such discoveries hold immense potential for improving diabetes treatment by guiding more effectiev, personalized care tailored to different demographic groups.
AI in Drug Development and the Importance of Diverse Training Data
As AI becomes increasingly integral to all phases of drug development—from discovery and clinical trials to manufacturing and supply chains—using training data that reflects the rich demographic composition of the United States is essential to ensuring that new therapies are safe and effective for the entire population.
AI-enabled drug development relies on large clinical and genetic datasets to predict drug safety and efficacy. However, AI models are only as good as the data they are trained on. If training data lacks diversity and representation, these models can reinforce and amplify existing biases, exacerbating health disparities. For example, AI models trained primarily on data from individuals of European ancestry may misclassify drug safety and efficacy predictions for patients of African, Hispanic, or Asian descent, leading to less-accurate treatment recommendations and increasing the risk of proscribing ineffective or even harmful therapies.
To ensure that novel AI-enabled therapies benefit the entire U.S. population, it is crucial to build diverse and representative training data and to promote clinical trial diversity so that AI models learn from data that truly reflects the country’s population. Without proactive efforts to expand medical datasets and clinical trials, the United States risks widening health disparities and losing this competitive edge in global biopharmaceutical innovation that benefits a wide range of different people worldwide. Strengthening policies that promote diversity in biomedical research can not only improve public health but also reinforce the nation’s leadership in the industry.