Search

OpenAI Trained AI Models on Copyrighted O'Reilly Media Books, Researchers Claim

The AI Disclosures Project conducted a test to see if OpenAI’s AI models could identify content from paywalled O’Reilly Media books.

Advertisement
Highlights
  • O’Reilly Media is said to not have any licensing agreement with OpenAI
  • GPT-4o was said to show the highest recognition of copyrighted content
  • Researchers used a membership inference attack in the test
OpenAI Trained AI Models on Copyrighted O'Reilly Media Books, Researchers Claim

As many as 3,962 paragraph excerpts from 34 O’Reilly Media books were used for the test

Photo Credit: Unsplash/ Levart_Photographer

OpenAI might have trained its artificial intelligence (AI) models on copyrighted content, according to a research paper. A recently published paper from the non-profit organisation AI Disclosures Project, the San Francisco-based AI firm's recent large language models (LLMs) showed a higher recognition of copyrighted content compared to its older models. The researchers used a recently developed method called DE-COP to detect copyrighted content in the AI models' training dataset. Notably, the study found that the GPT-4o mini was not trained on the specific copyrighted content.

Researchers Used DE-COP to Test OpenAI's Training Dataset

The study, titled Beyond Public Access in LLM Pre-Training Data, was conducted to check if OpenAI's AI models were trained on non-public book content. For the study, researchers focused on O'Reilly Media, a US online learning platform, which contains numerous copyrighted books. The founder of the platform, Tim O'Reilly, was also one of the co-authors of the study.

The researchers used DE-COP method to test whether the training data of the AI models contained copyrighted material. This is a relatively new test, introduced in a paper published in 2024. The method, also known as a membership inference attack, quizzes an AI model with a multiple-choice test to see whether it can identify copyrighted content from machine-generated paraphrased alternatives.

The researchers used Claude 3.5 Sonnet to paraphrase the copyrighted material. As many as 3,962 paragraph excerpts from 34 O'Reilly Media books were used for the test.

Based on the tests conducted, the researchers claimed to have found that the GPT-4o AI model showed the highest recognition of the copyrighted and paywalled O'Reilly book content with an 82 percent Area Under the Receiver Operating Characteristic Curve (AURUC) score. Notably, the AURUC score is part of the DE-COP method and is derived from the guess rates from the multiple-choice test.

The study also found that older OpenAI AI models, such as GPT-3.5 Turbo, showed lesser content recognition compared to GPT-4o, but still high enough to be significant. However, GPT-4o mini was found not to be trained on the paywalled O'Reilly Media books. The paper states the reason could be that the test is not effective against smaller language models.

For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.

 
Show Full Article
Please wait...
Advertisement

Related Stories

Popular Mobile Brands
  1. OTT Releases of the Week: Truth or Trouble, Motorheads, and More
  2. Samsung Galaxy A26 Review
  3. Tecno Pova Curve 5G India Launch Date Announced
  4. Xiaomi Civi 5 Pro With Snapdragon 8s Gen 4 SoC, 6,00mAh Battery Launched
  5. Jony Ive and OpenAI Said to Launch AI Device With Cameras in 2027
  6. Honor 400 Series With 200-Megapixel Main Camera Debuts
  7. Lava Shark 5G With Unisoc T765 Chipset, 5,000mAh Battery Launched in India
  8. Infinix GT 30 Pro 5G India Launch Date, Colours, Key Features Confirmed
  9. Xiaomi 15S Pro With With In-House XRING 01 SoC, 6,100mAh Battery Launched
  10. WhatsApp Rolls Out Voice Chat Feature With End-to-End Encryption
  1. Samsung Tri-Fold Smartphone Price Tipped to Exceed $3,000; Launch Timeline Leaked
  2. Indian Developer Underdogs Studios Reveals Gameplay for Mukti, Narrative Title Coming to PS5 and PC
  3. Xiaomi Watch S4 15th Anniversary Edition Unveiled With XRING T1 Chipset
  4. HSBC Launches Blockchain-Based Tokenised Deposit Service in Hong Kong
  5. Oppo A5x 5G With MediaTek Dimensity 6300 SoC, 6,000mAh Battery Launched in India: Price, Specifications
  6. Vercel Releases v0 AI Model for Web Application Development, Compatible with OpenAI API
  7. Infinix GT 30 Pro 5G India Launch Set for June 3; Colour Options, Key Features Revealed
  8. Reliance Jio Rolls Out Prepaid Gaming Plans With JioGames Cloud Access in India: Price, Benefits
  9. Landman Season 1 Now Available on JioHotstar: What You Need to Know About American Political Drama Series
  10. Fountain of Youth Now Streaming on Apple TV+: What You Need to Know About American Adventure Movie
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.
Trending Products »
Latest Tech News »