Skip to content
ITIF Logo
ITIF Search
How Yesterday’s Web-Crawling Policies Will Shape Tomorrow’s AI Leadership

How Yesterday’s Web-Crawling Policies Will Shape Tomorrow’s AI Leadership

January 5, 2026

Should AI models be allowed to train on personal information that is publicly available on the Internet? How countries answer this question will have significant implications for their global leadership in artificial intelligence.

Developers train modern generative AI models on a mix of data sources, including publicly available data (such as websites and government records), licensed data (news archives, academic journals), user-provided data (chat prompts and customer feedback), proprietary data (product documentation and support logs), and synthetic data (AI-generated text, images, and code).

Much of the content on the public Internet contains personally identifiable information. Social media profiles may reveal a person’s age or birthday. Government records may disclose where someone lives or their political affiliation. Personal websites, professional profiles, and news articles often contain information about people’s education, employment history, and online activity.

Because anyone with an Internet connection can access this information, it has long been understood to raise only limited privacy concerns. This approach underpins how search engines have operated for decades: They crawl, index, store, and make third-party websites searchable, even when those pages contain personal data.

The European Union’s data-protection regime has complicated this longstanding model. Under the General Data Protection Regulation (GDPR), companies must have a lawful basis for processing personal data, even when that data is publicly accessible on the Internet, and they assume extensive obligations as data controllers. In 2023, for example, Italy’s data-protection authority fined a company €60,000 for creating an online telephone directory compiled from publicly available websites, concluding that the firm lacked a lawful basis for processing the personal information.

European regulators have permitted search engines to continue crawling the web, but the GDPR’s “right to be forgotten” allows individuals to request removal of links to otherwise lawful, publicly available information. In effect, regulators have concluded that information may be public and yet still require concealment from search results in order to protect privacy.

The United States takes a markedly different approach. Developers generally face no legal restrictions on crawling publicly accessible websites. Even states with the strongest privacy laws, such as California, do not impose a right to be forgotten on search engines. Instead, they focus on the information that businesses collect directly from consumers, not on information gathered from the open Internet.

These diverging frameworks take on new importance in the AI era. As companies increasingly rely on large-scale web crawling to gather training data, regulators must decide what obligations to impose on those collecting such information and what rights, if any, individuals should exercise over its use.

In the EU, this question turns on the GDPR’s “legitimate interest” test. Regulators must determine whether training AI models on scraped web data satisfies a lawful purpose, whether the data collection is necessary for that purpose, and whether the interests of the data subjects override those of developers. Each question involves substantial discretion and creates legal uncertainty.

Regulators could, for example, conclude that large-scale web crawling is unnecessary because developers could purchase alternative datasets, or that individuals should have broad authority to block any use of their personal data. They might further decide that individuals can demand that data scraped from public websites be deleted or corrected, even though it is often technically impossible to “untrain” a model on specific data, placing developers in an untenable position.

Some of these tensions can be mitigated through technical and contractual tools. For example, website owners have long used the robots.txt protocol to specify whether and how automated systems may crawl their content. That same mechanism now allows websites to indicate whether AI developers may access their data. In addition, the Internet Engineering Task Force is developing new standards to give site owners even more granular control over how AI systems interact with their content. Other publishers have adopted pay-to-crawl models, requiring AI companies to pay a fee to access their sites. While these mechanisms were not designed specifically to govern personal data, they give content owners meaningful influence over how information they publish is reused.

AI developers have also implemented their own safeguards. Many companies deliberately avoid scraping websites that primarily host personal information, recognizing that such data provides little value for training general-purpose language and reasoning models.

Past debates over the right to be forgotten illustrate how fundamentally different European and American conceptions of public Internet data have become. In the AI era, those differences will determine how easy or difficult it is to develop new models, and therefore directly influence which regions foster AI innovation at scale.

The Internet may be forever, but regulatory frameworks should not be. Decisions made today about web crawling will help determine where the next generation of AI leadership emerges—whether in Europe, the United States, or elsewhere. If EU policymakers want their data protection regime to support both privacy and competitiveness, they should clearly and unambiguously affirm that developers may crawl the public Internet for AI training, even when that data includes personal information. A modern data governance framework should protect individuals without undermining the foundational data practices on which AI innovation depends. The alternative is a regulatory approach that preserves yesterday’s assumptions while sacrificing tomorrow’s technological leadership.

Back to Top