Countries Don’t Have to Build Their Own AI—Just Their Place in It
Some worry that the popularity of generative AI tools trained on predominantly English-language, Western-centric content might diminish their own countries or cultures. But instead of creating their own AI alternative trained on local content, countries would be better served by ensuring their own cultures and languages are digitally represented and available for training AI systems.
While today’s AI models are capable of supporting a wide range of cultural and linguistic outputs, developers have trained them using predominantly English-language content, particularly U.S.-based content, which has historically dominated the online data ecosystem. What’s more, research suggests large language models “think” in English, even when translating into other languages, which results in a loss of language-specific nuance.
This dominance isn’t malicious, but the result of economic geography. The United States is the global leader in AI development and the largest AI market. This pattern is not new, but risks flattening cultural diversity. The rise of mass media in the era of globalisation has led to both the exposure and creation of new cultural norms, but also the erosion of local traditions, language, and culture as popular content tries to appeal to the majority. Just as Hollywood became the global default for film, American firms have become the default AI storytellers.
Crucially, expecting technology firms—whether American, Chinese, or other global players—to fix this imbalance alone is unrealistic. While these companies have significant commercial incentives to diversify AI, it is not solely their responsibility to create globally representative models. Indeed, firms like Google and Meta have made substantial contributions to advancing multilingual AI.
However, the data divide remains a persistent challenge. Despite progress, large parts of the world, especially minority languages and cultures, are still underrepresented in the digital space.
Governments and communities around the world should take advantage of the commercial value of diverse data to ensure their languages, histories, and customs are part of the digital landscape. Consider Japan’s approach: Rather than relying on external actors, it has made targeted efforts to digitise its cultural heritage and linguistic assets, ensuring they can be incorporated into both domestic and global tech initiatives that also serve Japan’s AI needs. Empowerment starts with participation.
Regulation, such as the EU’s AI Act, won’t close representation gaps, and while well-intentioned, such efforts risk worsening these gaps if they suppress widely used tools without offering viable, inclusive alternatives. The problem is not that these models exist—but that too many communities are excluded from shaping them. Minority and Indigenous communities don’t want to be shielded from AI; they want to be represented in it.
In New Zealand, for instance, Te Hiku Media–a charitable media organisation with a core focus on Māori language revitalisation–developed a Māori speech recognition model that not only preserves the language but also sets ethical standards for how AI can empower, rather than erase, marginalised cultures. Their work shows that the path forward lies not in regulation that constrains innovation, but in participation that expands inclusion.
The most effective fix is increasing data availability, and governments should lead. If a culture, language, or community isn’t accessible online in a digitalised form, it won’t exist in AI. Governments should invest in initiatives that collect, structure, and license data from underrepresented communities to ensure their inclusion. For instance, Wikipedia has partnered with language activists to expand the corpus of endangered languages, not only preserving cultural heritage but ensuring it becomes part of the next generation of AI models.
These efforts go further when government-backed, such as through public-private partnerships like the Endangered Languages Project (ELP). First developed by Google, the ELP now rests with the First Peoples’ Cultural Council, a government corporation in British Columbia that guides the direction of the project to ensure continued protection of these communities as they are digitalised.
When globally represented data is readily available, AI systems have a better chance of becoming more inclusive and globally relevant. The dominance of Anglo-American training data in AI isn’t a problem to be regulated away—it is a gap to be filled. Governments have a duty to protect the cultural heritage of their diverse communities.
By prioritising the digitisation and availability of data that reflects this diversity, countries and communities stand a better chance of shaping AI in their own image, rather than submitting to someone else’s.