Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

EleutherAI, a non-profit AI research lab, reportedly compiled the dataset that trained Apple and Anthropic’s AI models.

Advertisement
Highlights
  • The stolen YouTube data comes from Marques Brownlee, MrBeast, and more
  • Apple reportedly used this dataset to train its OpenELM AI model
  • YouTube prohibits accessing videos using any automated means
Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

Data from Indian YouTube creators such as CarryMinati and Ashish Chanchlani was also reportedly swiped

Photo Credit: Reuters

Apple, Anthropic, and other major artificial intelligence (AI) firms have reportedly trained AI models on data from hundreds of thousands of YouTube videos. A new report claims that multiple AI companies used a publicly available dataset called Pile which contained the plain text of videos' subtitles without any video imagery. The data was collected from popular YouTube creators such as MrBeast, Marques Brownlee, and PewDiePie as well as Indian YouTube creators such as CarryMinati, BB ki Vines, and Ashish Chanchlani.

Multiple AI Models Reportedly Trained on YouTube Videos

Proof News conducted an investigation to find that subtitles data from as many as 1,73,536 YouTube videos were taken from more than 48,000 channels. As per the report, EleutherAI, a non-profit AI research lab, curated this dataset. Later, it was used by companies such as Apple, Anthropic, Nvidia, Salesforce, and more. Notably, the AI lab published a research paper highlighting the details of the dataset.

EleutherAI created a data repository of 800GB dubbed Pile and made it publicly available for those who wanted to train AI models but could not afford large datasets. The majority of the dataset was taken from publicly available sources such as English Wikipedia, e-books, and more. However, it also contained the subtitles from all the videos compiled in a dataset called YouTube Subtitles.

The report claimed that the Pile was used to train Apple's OpenELM AI model, on the basis of the research paper's description. Salesforce, Nvidia, and Anthropic's AI models' research papers also reportedly mention the usage of the dataset.

Anthropic spokesperson Jennifer Martinez told the publication in a statement, “The Pile includes a very small subset of YouTube subtitles. YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to the Pile authors.”

Notably, YouTube's terms of service prohibit anyone from accessing the videos on the platform using automated means such as robots, botnets or scrapers. YouTube Subtitles will fall under the scraping category. A Google spokesperson told Proof News in an email response that the tech giant has taken “action over the years to prevent abusive, unauthorised scraping.” However, no comments were made about AI firms' usage of the data.

In a post on X (formerly known as Twitter), Marques Brownlee called out Apple for sourcing data from companies that included his videos' transcripts, but he also highlighted that it was not the iPhone maker's fault since they did not collect the data.

While this dataset was collected and distributed publicly, there could be other instances of data scraping on platforms such as YouTube. With AI firms scrambling to find more data to train their large language models (LLMs), data procurement might continue to enter similar legally grey areas.

For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.

Akash Dutta

Akash Dutta is a Senior Sub Editor at Gadgets 360. He is particularly interested in the social impact of techn... more

Advertisement

Related Stories

Popular Mobile Brands
  1. A New Greece-Set God of War Game Is Reportedly Coming This Year
  2. Vivo T4 5G India Launch Timeline, Price Range, Key Features Leaked
  3. Realme Narzo 80 Pro 5G to Launch in India Soon; Will Use This New Chipset
  4. Motorola Edge 60 Fusion India Launch Date, Design, Key Features Revealed
  5. These Agentic Adobe Tools Can Turn Complex Data Into Actionable Insights
  6. Poco F7 India Launch Timeline, Chipset Details Tipped Online
  7. iQOO Z10 Teased to Have a Thin Profile; to Be Available on Amazon
  8. Headphone Zone X Oriveti Blackbird In-Ear Monitor Launched in India
  9. Infinix Note 50X 5G Confirmed to Offer IP64 Rating Ahead of India Launch
  10. Apple's Foldable iPhone Could Launch in 2026 With iPhone 17 Air Technology
  1. Indiana Jones and the Great Circle's PS5 Release Date Will Reportedly Be Announced on March 24
  2. Headphone Zone X Oriveti Blackbird In-Ear Monitor Launched in India: Price, Specifications
  3. Tamil Nadu DGP Unveils ‘Handbook for Investigations into Virtual Digital Assets’: All Details
  4. Poco F7 India Launch Timeline Leaked; Tipped to Feature Snapdragon 8s Elite Chipset
  5. Nvidia Releases Cosmos-Transfer1 AI Model That Can Be Used for Simulation-Based Training for Robots
  6. Vivo T4 5G Could Launch in India in April; Price Range, Key Features Surface Online
  7. Adobe Previews Multiple New AI Agents-Driven Enterprise Tools for Complex Data Analysis
  8. Realme Narzo 80 Pro 5G Teased to Launch in India Soon; Will Be Equipped With MediaTek Dimensity 7400 SoC
  9. Android 16 Developer Preview 3 Reportedly Enables Screen-Off Fingerprint Unlock on All Pixel Phones
  10. iQOO Z10 Teased to Measure 7.89mm in Thickness; to Be Available on Amazon
Gadgets 360 is available in
Download Our Apps
App Store App Store
Available in Hindi
App Store
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.
Trending Products »
Latest Tech News »