Sensibility.ai
Posts
Is Writing A Video With Text The Future?

Is Writing A Video With Text The Future?

How OpenAi's Sora stacks up against other text-to-video LLMs

Christian Wood & Matt Heisler
February 26, 2024

OpenAI is joining others and finally coming out with a text-to-video LLM dubbed Sora.

If you don’t know, text-to-video LLM’s are just that, they are large language models in which the user describes the video they want in detail (Stable Diffusion is an example of this) and it creates a video based on what you described. In the world of AI-generated videos, there is some big competition coming.

With Sora coming out let's take a look at how the rest of these text-to-video LLMS stack up against each other.

Sora

OpenAi’s Sora

OpenAI's Sora which is the latest text-to-video LLM. Since Sora is built off a diffusion model, it results in a high-definition video clip up to one minute long, featuring complex scenes, multiple characters, and intricate details of the environment.

Sora is capable of understanding not only the user's prompt but also the underlying physics and spatial dynamics of the scene. This enables the model to create lifelike and imaginative videos that adhere closely to the user's instructions. The model excels at depicting complex scenes with multiple characters, nuanced emotions, and intricate details of the environment.

One of the key features of the Sora model is its ability to maintain subject consistency, even when the subject momentarily disappears from view. This is a significant advancement in AI video generation, as it allows for a more realistic and seamless viewing experience.

To create videos, Sora uses a transformer architecture similar to GPT models. Images and videos are represented as patches, which allows the model to be trained on a wide range of data with different durations, resolutions, and aspect ratios. The model also leverages recaptioning techniques from DALL-E3, ensuring that the generated videos closely follow the user's text instructions.

Gen-2

Gen-2 LLM Video Example

Gen-2 is a text-to-video platform that can produce realistic, cinematic videos based on user prompts. It has various AI-powered features and is a subscription-based service that is available to anyone who wants to create or enhance their video content. On the other hand, Sora is a diffusion model that generates videos by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps. It can create lifelike scenes and stories from text input and is capable of generating videos in minutes.

In a head-to-head comparison, it appears that Sora may have the edge over Gen-2 in terms of video quality and realism, as well as the ability to generate longer videos while maintaining coherence with the prompt. However, Gen-2 has the advantage of being available to anyone who wants to use it, while Sora is currently only available to select individuals chosen by OpenAI.

Google Lumiere

Lumiere is a text-to-video AI model developed by Google, which was released in 2023. It is capable of generating realistic and diverse video content from text prompts. Lumiere can also perform tasks such as video editing, creating videos from images or other videos, and extending or combining video elements.

When comparing Lumiere to OpenAI's Sora and Gen-2, it is important to consider the strengths and weaknesses of each model. While Lumiere is a powerful tool for generating realistic video content, it is limited to lower resolution outputs (512 × 512 pixels) and shorter video lengths (around 5 seconds).

On the other hand, both Sora and Gen-2 offer higher resolution outputs (up to 1920 × 1080 pixels for Sora and 768 × 768 pixels for Gen-2) and longer video lengths (up to 60 seconds for Sora and 14 seconds for Gen-2). These models also have capabilities such as video-editing and combining video elements.

In terms of video quality, both Sora and Gen-2 produce more dynamic and visually appealing content compared to Lumiere. However, all three models may suffer from hallucinations, which means that the generated videos may not always be accurate or realistic.

Meta’s Make-A-Video

Meta’s Make A Video

Meta's Make-A-Video is an AI-powered text-to-video model developed by Meta AI. It allows users to generate videos from text prompts, making video creation more accessible to a wider audience.

Make-A-Video offers high-quality video outputs, but it is important to note that the quality of the generated videos depends on the text prompt and the training data used. While Make-A-Video has made significant advancements in video generation, the quality of its output might vary when compared to its competitors.

Make-A-Video focuses on generating videos from text prompts, while other models like Sora and Gen-2 also offer additional features such as video editing, combining video elements, and extending or shortening video lengths. Lumiere, on the other hand, is known for its text-to-video capabilities and its ability to generate realistic and diverse video content.

In conclusion, although Sora, Open AI’s text-to-video LLM has not yet been released, it seems to beat the competition from what we have seen so far.

This Week in AI

This week in AI News Open AI’s GPT has gained support for cross-chat memory. This means that Chat GPT will be able to remember and reference anything you have previously talked about with it.

Sensability.AI is a weekly newsletter to keep you up to date in the ever-changing landscape of AI by the team at Cappital.co

Is there a tool you want us to look into? We can look into it for you.