OpenAI recently gave us all a peek into its latest generative AI offering Sora, and it was mindblowing. Sora can create videos a minute long with just a text prompt, but what makes the tech so impressive is its ability to understand and simulate physics, which is why OpenAI characterises Sora as a ‘world simulator.’ Some of the videos the company has released to the public have to be seen to be believed.
Sora can generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background – all in videos with different resolutions and aspect ratios.
OpenAI says they are teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.
“Unlike traditional AI models that rely on static representations, Sora introduces dynamic simulations. This allows it to simulate complex scenarios with a level of detail and realism previously unattainable. The ability to dynamically model and visualise scenarios sets Sora apart as a revolutionary advancement in artificial intelligence,” says Lakshmikant Gundavarapu, chief innovation officer at Tredence.
While Sora uses a transformer architecture similar to the ones used in GPT models, Rahul Agarwalla, co-founder of SenseAI Ventures, says that interestingly it ditches the standard diffusion model construct used by most video generators like Stable Diffusion and has a new diffusion plus transformer architecture which OpenAI claims gives it a gain in performance. Sora’s diffusion models generate videos by starting off with videos that look like static noise and gradually transforming them by removing the noise over many steps.
“However, it still has issues with real world understanding. One of the videos shows a high-res monkey playing chess on a 7×7 board with three kings. We are not quite there yet, but boy are we making progress,” says Rahul.
OpenAI has itself warned that Sora hasn’t been released to the public yet and that the model still gets a lot of scenarios wrong, but the sheer breadth of complex scenarios that the model does get right is what has impressed fans and critics alike.
A lot of text-to-image models used to struggle to follow detailed image descriptions and would often ignore words or confuse the meaning of prompts. This problem was solved by OpenAI by training their DALL-E 3 model on highly descriptive generated image captions. This same technique is what allows Sora, a text-to-video generator, to understand a wide array of highly descriptive scenarios. Essentially, it’s been shown a humongous number of videos and accompanying captions that described those videos.
Sagar PV, chief technology officer & head of technology & innovation group at Mindsprint, says that OpenAI is putting together parts of a larger puzzle that are in the direction of creating artificial general intelligence (AGI) – an AI system that has the capabilities of an average human being. “With ChatGPT, Sora, investments towards creating autonomous AI Agents, and a whisper model for speech recognition, we aren’t far from the day when AGIs can do a multitude of human tasks. The release of Sora from that perspective is a significant leap towards creating a world that could in every sense of the word revolutionise economies, jobs, productivity and more, and brings us one step closer to the reality of AGI,” he says.
REAL WORLD DISRUPTION
Nick Magnuson, head of AI at Qlik, says that we are likely to see meaningful productivity gains across many industries as organisations become more attuned to the potential of such technology. “Think of the time and effort required today to generate meaningful and high-quality video content. As we’ve seen with other forms of generative AI, it has two pronounced effects: makes the subject matter expert far more efficient and productive, while also lowering the technical barriers to those who can engage in such tasks.”
Nick foresees the advertising industry, filmmaking, gaming, and media & entertainment industries to be some of the initial beneficiaries of such generative AI models.