Nvidia is under scrutiny following revelations that its “Sora” project, codenamed Cosmos and led by Vice President of Research Liu Mingyu, may have involved the illegal acquisition of vast amounts of data.
Internal documents leaked to 404Media suggest that Nvidia employees were reportedly encouraged to collect unauthorized data from popular platforms such as YouTube and Netflix on a daily basis. This data collection is said to amount to the visual equivalent of what a person would see in an entire lifetime, gathered every single day.
Nvidia has responded firmly to these allegations, asserting that their data collection practices are entirely legal. According to the company, their methods adhere strictly to copyright laws, which protect expressions but not the underlying data or information. Nvidia contends that their use of such data falls within the bounds of fair use, particularly for transformative purposes like training their models.
The leaked documents, obtained by 404Media, reveal that Nvidia’s Cosmos project aims to develop a cutting-edge video base model. This model integrates simulations of light transmission, physics, and intelligence to enable a range of applications, including the Omniverse 3D world generator, autonomous driving systems, and digital human technologies.
Ming-Yu Liu, an esteemed IEEE Fellow who leads the Cosmos project, has overseen previous innovations such as NVIDIA Picasso [Edify], NVIDIA Canvas [GauGAN], and NVIDIA Maxine [LivePortrait]. An email from May detailed the project’s ambition to create a video data pipeline capable of generating training data equivalent to a lifetime of human visual experience each day.
The leaked documents also included links to various video datasets, such as MovieNet, WebVid, InternVid-10M, and proprietary video game footage. A former employee disclosed that Nvidia utilized an open-source YouTube video downloader, yt-dlp, to bypass platform restrictions by refreshing IP addresses.
In response to inquiries from 404Media, Nvidia reiterated their respect for content creators’ rights and maintained that their research and model training are compliant with copyright legislation.
Google has pointed out that a YouTube CEO’s statement earlier this year indicated that using YouTube videos to train models like Sora would violate YouTube’s terms of service. Meanwhile, Netflix has confirmed it did not authorize Nvidia to extract content, citing restrictions in its terms of service.
Interestingly, the day the Nvidia allegations surfaced, YouTube bloggers were pursuing a class action lawsuit against OpenAI, claiming the company used millions of YouTube videos to train its generative AI models without proper notification or compensation to the content creators.
While the practice of data scraping by major tech companies is not new, the raw data remains highly valuable. Previous research has shown that large models trained on original internet data tend to achieve superior performance and data quality, although they face risks as the abundance of AI data increases.
Related topics:
What Is Residual Learning for Image Recognition