---
title: "Canada’s Privacy Ruling on AI Training Data Sets a Bad Precedent"
summary: |-
  Canada’s privacy regulators are restricting the use of public online data for AI training, but this approach could undermine AI innovation. Canada should instead adopt a harm-based framework focused on concrete privacy risks.
date: "2026-05-12"
issues: ["Artificial Intelligence", "Privacy"]
authors: ["Daniel Castro"]
content_type: "Blogs"
canonical_url: "https://itif.org/publications/2026/05/12/canadas-privacy-ruling-on-ai-training-data-sets-a-bad-precedent/"
---

# Canada’s Privacy Ruling on AI Training Data Sets a Bad Precedent

Canada’s privacy regulators are taking a misguided approach to AI training data. In a [recent decision](https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2026/pipeda-2026-002-overview/), the federal Office of the Privacy Commissioner (OPC) and several provincial authorities concluded that OpenAI violated Canadian privacy law by, among other claims, using publicly accessible Internet data and licensed third-party datasets to train ChatGPT.

The OPC acknowledged that OpenAI’s broader purpose—developing and deploying generative AI systems—was appropriate. They also recognized that user interaction data could legitimately be used to improve model performance. But it concluded that OpenAI’s use of publicly accessible online information was “overbroad” and failed to satisfy Canadian consent requirements because individuals would not have reasonably expected their public data to be used to train AI systems. BC and Alberta went further, finding the consent problem unresolved regardless of OpenAI's mitigation measures.

That conclusion reflects a flawed understanding of how modern AI systems are developed and risks placing Canada on the wrong side of global AI competition.

Large language models (LLMs) depend on [access to large-scale datasets](https://commoncrawl.org/overview) to learn how language, reasoning, and information retrieval work. Publicly accessible websites, discussion forums, academic content, and licensed datasets are foundational inputs for training these systems. Restricting access to those materials would not only constrain a single company but also undermine the development of advanced AI systems across the broader ecosystem, including by startups, researchers, and open-source developers.

The OPC places significant weight on the claim that people did not reasonably expect publicly available information to be used for AI training because the practice was “[novel](https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2026/pipeda-2026-002-overview/)” and “[not widely understood](https://www.priv.gc.ca/en/opc-actions-and-decisions/investigations/investigations-into-businesses/2026/pipeda-2026-002-overview/)” at the time. But novelty alone is not a sound basis for restricting technological development. Indeed, limiting businesses to using data only in ways consumers already understand would, almost by definition, constrain innovation and the emergence of new technologies. Most transformative technologies initially used data in ways consumers did not fully anticipate, including search engines, translation systems, spam filters, and recommendation algorithms. 

Over time, societies adapted because the public benefits were substantial, and manageable safeguards could be implemented. The decision also highlights a problematic feature of Canadian privacy law: the narrow regulatory distinction between “publicly accessible” and “publicly available” information. Canadian privacy law contains definitions for when consent exceptions apply to [publicly available data](https://laws-lois.justice.gc.ca/eng/regulations/SOR-2001-7/page-1.html), ones that predate modern AI and were not designed with Internet-scale training datasets in mind. The OPC applied them as written, but the result is that AI developers collecting from a diverse range of online sources cannot rely on those exceptions and must instead obtain express consent in many cases.

That standard is unworkable. It is not feasible to obtain express consent from billions of individuals whose publicly viewable information may appear somewhere in Internet-scale training datasets. Requiring that level of consent would effectively prohibit the development of frontier AI systems in Canada while doing little to improve meaningful privacy protections.

More importantly, the regulators largely overlook the marginal privacy risk created by AI training on [publicly accessible information](https://itif.org/publications/2026/03/13/how-rules-for-publicly-available-data-are-shaping-the-future-of-ai/). If information is already publicly available online, the relevant policy question is not whether an AI model processed that information during training, but whether doing so materially increases the risk of concrete harm to individuals.

Much of the information at issue was already intentionally shared in public forums, social media posts, websites, and online discussions. Someone publicly expressing political views online, for example, necessarily understands that others may access and read that information. The mere fact that an AI system may learn linguistic or statistical patterns from publicly available text does not inherently create a new category of privacy harm.

The regulators place particular emphasis on the possibility that training datasets could include sensitive or inaccurate information. But if that information is already publicly accessible on the Internet, then the relevant question is whether AI training meaningfully changes the level of exposure or harm. A search engine can already index that content. The regulators never clearly establish why AI training should be treated under a fundamentally different standard simply because the processing occurs through an LLM rather than another Internet-based system.

Nor does the OPC meaningfully engage with the mitigation measures OpenAI implemented to reduce any incremental risks. According to the findings, the company developed filtering tools to detect and mask personal information in publicly accessible Internet data and licensed datasets before training models, reduced the use of sensitive information in fine-tuning, and introduced mechanisms to support correction and deletion requests. Yet despite acknowledging that these measures significantly reduced residual risks, the regulator still characterized the earlier use of public Internet data as inherently problematic.

The OPC also revealed a deeper problem in its reasoning: It concluded that concerns about reasonable expectations were reduced because public awareness of AI systems had increased since the launch of ChatGPT. In other words, the practice became more acceptable once consumers became familiar with it.

If the same underlying practice becomes lawful primarily because the public is now accustomed to it, then the issue is less about concrete privacy harms and more about the OPC reacting to the novelty of the technology itself. Indeed, the decision effectively allows later AI developers to benefit from the public awareness created by early innovators while penalizing the companies that introduced the technology in the first place.

This creates unnecessary regulatory uncertainty, particularly for novel technologies. Companies developing new products cannot reliably predict whether regulators will later determine that consumers did not sufficiently expect or understand a new use of publicly accessible data. That approach risks turning data protection enforcement into a moving target shaped less by measurable harms than by evolving public sentiment toward emerging technologies. At that point, Canada’s data protection enforcement starts to look less like risk mitigation and more like regulation based on vibes.

Canada should focus on preventing concrete harms, not creating regulatory uncertainty around the use of publicly available information that is foundational to modern AI development.

---
*Source: Information Technology & Innovation Foundation (ITIF)*
*URL: https://itif.org/publications/2026/05/12/canadas-privacy-ruling-on-ai-training-data-sets-a-bad-precedent/*