Internal Documents Reveal Nvidia’s Controversial Data Collection Practices
Recent revelations from internal Slack messages, emails, and documents obtained by 404 Media expose Nvidia’s extensive use of YouTube videos and other copyrighted materials to compile data for its artificial intelligence (AI) products. The leaked information suggests that Nvidia is scraping a staggering volume of content, amounting to “80 years” of video daily, to fuel its AI models.
Nvidia, when confronted with questions about the legality and ethical implications of utilizing copyrighted content, asserted that their practices “fully comply with the letter and spirit of copyright law.” Despite this, internal communications reviewed by 404 Media indicate that concerns about the legality of using such datasets were dismissed by management, who claimed that the practice had been authorized at the highest levels of the company.
An anonymous former Nvidia employee revealed to 404 Media that the company instructed staff to extract videos from platforms like Netflix and YouTube. These videos are intended to train Nvidia’s AI models, which are integral to projects such as the Omniverse 3D world generator, self-driving car systems, and the development of “digital humans.” The internal project, known as Cosmos, is distinct from Nvidia’s existing Cosmos deep learning product and has yet to be publicly released.
An email from the project leader outlined the ambitious goals of Cosmos, stating the intention to develop a cutting-edge video-based model that “integrates simulations of light transport, physics, and intelligence to unlock a range of downstream applications critical to Nvidia.” Ming-Yu Liu, Nvidia’s vice president of research and head of the Cosmos project, highlighted in a May email that the project aims to establish a data pipeline capable of generating a lifetime’s worth of visual training data daily.
The internal conversations and directives shed light on the significant challenges faced by Nvidia employees as they navigate legal and ethical dilemmas while developing technologies that drive the AI industry. The situation also underscores the broader issue of tech companies’ voracious demand for content to train their AI models, exemplified by industry leaders like Runway and OpenAI.
In response to the allegations, an Nvidia spokesperson reiterated that the company respects content creators’ rights and believes its research and models adhere to copyright law. The spokesperson emphasized that copyright law safeguards specific forms of expression, not facts, ideas, data, or information, and that fair use provisions cover the transformative use of such content, including for model training.
Regarding the use of YouTube videos, a Google spokesperson referred to previous statements, citing an April 2024 Bloomberg article where YouTube CEO Neal Mohan criticized OpenAI’s use of YouTube videos as a “clear violation” of the platform’s terms of service.
Netflix also confirmed to 404 Media that it does not have a content use agreement with Nvidia, noting that its terms of service prohibit scraping.
The investigation highlights a troubling trend within the tech industry: a disregard for obtaining permission when harvesting substantial amounts of copyrighted content for training AI models. This practice raises significant questions about the ethical and legal boundaries of data usage in advancing AI technologies.
Related topics:
Why Hasn’t Sora Been Released Yet?