OpenAI‘s recent unveiling of Sora, a dynamic video and interactive 3D environment generator, has set a groundbreaking benchmark in the realm of Generative Artificial Intelligence (GenAI). At the heart of this innovation lies the diffusion transformer, an AI model architecture that has quietly been evolving in the background for years.
The diffusion transformer, now gaining widespread attention, not only powers OpenAI’s Sora but also features prominently in Stability AI’s latest image generator, Stable Diffusion 3.0. This architectural breakthrough is poised to redefine GenAI capabilities, facilitating scalability beyond previous limits.
The journey to the diffusion transformer began in June 2022 when Saining Xie, a computer science professor at NYU, initiated a research project. Collaborating with William Peebles, co-lead of Sora at OpenAI, they integrated two machine learning concepts — diffusion and transformers — giving birth to the diffusion transformer.
In essence, the diffusion process involves gradually adding noise to media, such as images, until it becomes unrecognizable. The model then learns to subtract this noise step by step, moving closer to the target output. While traditional diffusion models use U-Net backbones for this process, transformers emerge as a more efficient alternative.
Transformers, renowned for their application in complex reasoning tasks, boast a distinctive feature known as the “attention mechanism.” This mechanism enables transformers to weigh the relevance of input data, making them simpler, parallelizable, and efficient.
Xie explains, “What transformers contribute to the diffusion process is akin to an engine upgrade. The introduction of transformers … marks a significant leap in scalability and effectiveness.” This transformation is evident in models like Sora, which harness vast volumes of video data and extensive model parameters to showcase the transformative potential of transformers at scale.
Despite the diffusion transformer concept existing for some time, its adoption in projects like Sora and Stable Diffusion took years to materialize. Xie attributes this delay to the recent realization of the crucial need for a scalable backbone model. “The Sora team really went above and beyond to show how much more you can do with this approach on a big scale,” he notes, affirming that transformers are now the preferred choice over U-Nets for diffusion models.
Xie envisions a future where diffusion transformers seamlessly integrate content understanding and creation. He emphasizes the necessity of standardizing architectures, with transformers emerging as ideal candidates for this purpose.
As Sora and Stable Diffusion 3.0 showcase the potential of diffusion transformers, the AI community anticipates a transformative journey into uncharted territories.