OpenAI in the storm: a study accuses the company of training its models on copyright-protected data

2 days ago 19
openai training copyright data

A new study published by the AI Disclosures Project raises crucial questions about the transparency and ethics of the training methods adopted by OpenAI. At the center of the investigation is GPT-4o, the company’s latest language model, accused of recognizing and using copyright-protected content from confidential material of the publisher O’Reilly Media.

A research to increase transparency in the race for artificial intelligence

The study was conducted as part of the AI Disclosures Project, an initiative led by technologist Tim O’Reilly and economist Ilan Strauss. The objective of the project is clear: to promote greater responsibility in the artificial intelligence sector, emphasizing the need for transparent practices by companies developing LLM (Large Language Models).

The document, a working paper, analyzes how the current lack of formal obligations in the disclosure of data used to train these models can create serious issues of trust and legality. The researchers argue that, just as disclosure rules have emerged in financial markets to foster robust markets, a similar regulation is needed in the field of AI to prevent abuses and systemic damage.

OpenAI conducts tests on copyright-covered content: the data is clear

To conduct their investigation, the researchers used a legally obtained dataset of 34 copyright-protected books published by O’Reilly Media. Through a technique known as DE-COP membership inference attack, they measured whether language models were able to distinguish texts written by humans from paraphrased versions generated by LLM. The results point the finger at GPT-4o.

The AUROC scores (a metric that measures a model’s ability to distinguish between two classes) speak clearly:

  • GPT-4o achieved a score of 82% in the ability to recognize non-public content sourced from O’Reilly Media — a value that indicates a strong likelihood that those contents were included in the training data.
  • For comparison, the GPT-3.5 Turbo model showed a much weaker recognition (about 50%), while GPT-4o Mini, a lighter version of the model, did not recognize either public or non-public content.
  • The data also shows that GPT-4o recognizes non-publicly accessible materials better (82% AUROC) compared to freely available O’Reilly content (64% AUROC).
  • On the contrary, GPT-3.5 Turbo shows a greater familiarity with public content (64%) compared to reserved content (54%).

A possible source of access to content covered by copyright, researchers suggest, could be LibGen, a well-known online platform that hosts unauthorized digital copies of books protected by copyright. All the books analyzed in the study are indeed present in that virtual library.

The AI industry between legality, economy, and content sustainability

The results raise a broader question: the systematic use of copyright-protected data to train language models could undermine the economic sustainability of professionals who create original content. If companies are not compensated for the use of their publications, the entire information ecosystem — including publishing — risks becoming impoverished.

According to the report, the unauthorized use of proprietary data without compensation contributes to the reduction of quality and diversity of online content. And it’s not just an ethical issue but also an economic one: without revenue, publishers and professional creators cannot continue to produce valuable content.

The concern of researchers: the models know more than they declare

Another aspect highlighted by the study is the improvement in the ability of the most recent language models to understand and reproduce the subtle differences between human language and language generated by artificial intelligences. Furthermore, the researchers emphasize the possible presence of a “temporal bias”, as languages evolve over time. To neutralize this type of distortion, the tests were conducted on two models (GPT-4o and GPT-4o Mini) trained during the same time period.

Despite the evidence being limited to a specific case — OpenAI and O’Reilly texts — the authors believe that the phenomenon could be widespread and systemic in the generative artificial intelligence sector.

Towards a Legal Market for Training Data?

The study concludes with a broader reflection: it is necessary to build a system in which sviluppatori di IA can legally access dati per l’addestramento, through accordi di licenza and compensi trasparenti for content creators.

Some companies are already developing business models in this direction. This is the case of Defined.ai, a platform that sells data for training while ensuring the consent of the authors and removing all personally identifiable information. A regulated industry could represent a legal and sustainable alternative to the current opaque behaviors of some large AI companies.

The role of politics: Europe can lead the way

The report also offers a regulatory insight: the entry into force of the new AI Act of the European Union, which provides for disclosure obligations for the training of models, could trigger a virtuous spiral. If the rules are well specified and truly enforced, rights holders will finally be able to know when and how their works are being used.

This would be a fundamental step for the creation of mercati legali where creators’ content is effectively recognized as beni economici used by AIs.

Not only OpenAI: a study that points the finger at an increasingly frequent practice

Through a detailed analysis of 34 books from the publisher O’Reilly Media, the researchers provided empirical evidence suggesting that OpenAI likely trained its GPT-4 model on non-public and copyright-protected data.

If confirmed, it would be a potentially serious violation not only of copyright rules but also of the principles of transparency, consent, and equity that large tech companies should adhere to in an era where artificial intelligence increasingly impacts our society and our economy.

Read Entire Article