ITIF Logo
ITIF Search
In the Wake of Generative AI, Industry-Led Standards for Data Scraping Are a Must

In the Wake of Generative AI, Industry-Led Standards for Data Scraping Are a Must

September 1, 2023

Shortly after the release of OpenAI’s popular generative AI system, ChatGPT, some website owners began complaining about AI companies scraping data—automatically gathering data from the public Internet—to train their AI systems. While courts have repeatedly reaffirmed that web scraping is legal in the United States, significant public concerns about AI have raised the risk that Congress might step in and pass anti-scraping legislation. However, a legislative intervention would be a mistake, especially given that the private sector has previously resolved a similar issue through voluntary measures.

Nearly 30 years ago similar complaints arose over the use of web crawlers, which are automated programs that index the content of webpages. Internet search engines widely deployed these bots to systematically browse the Internet to find and update their databases of webpages. However, website owners complained over their use, as the bots can create unwanted website traffic, increase server and network loads, and affect user experiences. In response, Internet engineers created the Robots Exclusion Protocol in 1994, a voluntary, community-developed standard to inform a web crawler about which parts of a website it can crawl. Since then, website owners and web crawlers have widely adopted this standard, balancing website owners’ concerns and search engines’ requirements, without the need for regulatory action.

The concerns today about data scraped from the public Internet to train AI systems are similar to prior complaints from website owners about search engine web crawlers. Just as search engine companies need to scrape data to provide accurate and up-to-date search results so too do AI companies need to scrape data to train their AI systems. For example, OpenAI has explained that scraping data from websites “can help AI models become more accurate and improve their general capabilities and safety.”

Web scraping is legal in the United States, but there is a risk that policymakers could decide to intervene. Indeed, top data protection regulators from a dozen countries—including Australia, Canada, Mexico, China, and the UK—recently published an open letter to website operators urging them to “implement measures to protect against unlawful data scraping.” But new laws and regulations are not necessary given that the private sector is already taking steps to give website operators more control over whether AI web crawlers scrape their sites.

First, many websites can use the existing Robots Exclusion Protocol to restrict web crawlers from popular AI companies. OpenAI, for example, provides details on its crawler, which allows website owners to easily disallow it from accessing their sites. Almost 20 percent of the top 1,000 websites in the world have blocked AI crawlers using this method, which shows how easily it can be done. Second, the private sector is exploring additional technical standards that would give website owners and content producers more control. For example, Adobe has proposed that creators can attach “Do Not Train” metadata to their work to inform companies within the AI industry that they cannot add their work to datasets used to train AI. Google has similarly stated that the Internet community should collaborate with the AI community on developing machine-readable standards that give website owners more control, and the company has announced its intent to lead a public discussion on this topic. As these initiatives gain momentum, organizations such as the Internet Engineering Task Force will likely provide a forum for finalizing the standard.

Given the success of non-government solutions to concerns about web crawling, policymakers should not intervene at this stage. The AI industry is evolving rapidly. Creating new laws or regulations in the United States to restrict how organizations and individuals can scrape publicly available data on the Internet to train their AI models will blunt progress and impede their ability to adapt to new developments or challenges and hurt U.S. competitiveness in AI.

Back to Top