AI Generated Text-to-Video With OpenAI Sora Is Stunningly Realistic

Vincent T.
High-Definition Pro
9 min readFeb 22, 2024

--

AI (Artificial Intelligence) continues to fascinate, as Open AI introduce a text-to-video tool called Sora. The results are mind-bogglingly stunning and truly impressive. It is like ChatGPT, but for generating videos.

If you were watching the demos, you would not have thought they were AI generated until someone tells you. They look so real it makes it seem like it was filmed by a real camera with real people behind the scenes.

Technology like this has implications in the real world. It is going to be disruptive to existing industries which can affect content creation and production.

Open AI Sora

Sora delivers a text-to-video generation tool that has grabbed headlines for its ability to create realistic and complex videos based on user prompts. If you have seen an example of one of the demo videos, you might not even realize it was all supposed to be AI generated.

According to a source, Open AI researchers Tim Brooks and Bill Peebles chose the name “Sora” (which means sky in Japanese) because it “evokes the idea of limitless creative potential.” In other words, “the sky is the limit” when it comes to the possibilities available with this technology.

Sora uses a diffusion model, to generate videos starting with a raw and noisy version. It then transforms the content by removing the noise through several steps.

Figure 1. The layers of transformation of the raw image from noisy to enhanced output. Sora performs this process frame-by-frame.

From Open AI research:

This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

Sora utilizes a diffusion model and transformers. It is similar to DALL-E 3, which starts with noise and iteratively refines it into the desired video frame-by-frame. It then leverages a powerful transformer architecture, commonly used in language models like GPT-3, for efficient processing and scalability.

It also uses Patch-based Representation. The video and image data are represented as smaller patches, enabling the model to handle diverse video formats and resolutions.

OpenAI has not yet disclosed the specific details of the training data used for Sora (as of this posting). However, it’s likely to be a massive dataset (i.e. LLM Large Language Model) of text-video pairs that encompass various scenarios, actions, and environments.

According to OpenAI:

In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames.

Figure 2. Prompt: The story of a robot’s life in a cyberpunk setting.

Here is an example of how the large language model works. Sora created a story just from what appears to be a simple prompt. Figure 2 demonstrates how Sora can create visualization that represents an interpretation of the world based on the context provided (i.e. cyberpunk setting).

Stunning Results

The videos generated on their demos looked quite realistic and captivating. They really look like they were videos shot by a crew and edited by a human expert.

Figure 3. Prompt: Tour of an art gallery with many beautiful works of art in different styles.

In Figure 3 the lighting looks spot on perfect. The colors look natural in the scene, appearing to be styled by a professional and shot with the latest high-resolution cameras. It looks like a first-person view (FPV) of touring an art gallery.

It is unlike the AI generated videos from previous. What makes it different is that Sora has the ability to generate complex scenes that contain multiple characters, with specific types of motion, and accurate details. It takes what the user types in the prompt and generates a result that far exceeds expectations.

Figure 4. Prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

In Figure 4. we see a demo that was one of the most convincing in terms of cinematic effects. If I had no idea it was AI, I would have thought it was a real movie trailer for an upcoming big-budget film production.

Who is the actor though? Is this really AI or are they faking it? Looks too real to be computer generated. This could be an example of how far AI algorithms have progressed in the last few years, given it is genuine.

When a user puts in what they want to generate in the prompt, Sora takes the words from the text and maps it to its understanding of the physical world. Based on that data, it then begins to put the pieces together and renders a video on the display.

Figure 5. Prompt: A Chinese Lunar New Year celebration video with Chinese Dragon.

It really puts the creative control in the AI algorithm that generates the video. All that a user has to do is type in something they want to generate, and like magic it is rendered. The demo in Figure 5 is an example of that.

Sora was originally available for trial runs with certain users (i.e. red teamers). It will soon be available in some form to the general public, so the demos were all created by Open AI.

The End Of Traditional Filmmaking?

AI-powered video generation like OpenAI Sora presents challenges. This can affect the demand for human actors, producers, directors, and other staff needed for films.

A creative director can generate everything they need using an advanced form of Sora to make their own movie. It cuts costs and complexities in production, and it was all done sitting in front of a computer. You now have your “armchair filmmakers”.

Figure 6. Prompt: The camera rotates around a large stack of vintage televisions all showing different programs — 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.

The demo in Figure 6 would be a big budget production with vintage televisions. If it can be done with AI, it significantly cuts production costs and the labor required to setup something of this concept (e.g. many televisions, wiring setup, location scouting, etc.).

It does not mean an end to traditional filmmaking, but rather opening the doors for exciting new possibilities. There are still aspects that are best filmed the traditional way, since it gives the director more creative control.

Figure 7. Prompt: A grandmother with neatly combed grey hair stands behind a colorful birthday cake with numerous candles at a wood dining room table, expression is one of pure joy and happiness, with a happy glow in her eye. She leans forward and blows out the candles with a gentle puff, the cake has pink frosting and sprinkles and the candles cease to flicker, the grandmother wears a light blue blouse adorned with floral patterns, several happy friends and family sitting at the table can be seen celebrating, out of focus. The scene is beautifully captured, cinematic, showing a 3/4 view of the grandmother and the dining room. Warm color tones and soft lighting enhance the mood.

When it comes to simulating complex interactions between objects and multiple characters, Sora is not quite perfect. In Figure 7 we see an example of this, using a more descriptive prompt. While we do see what appear to be real humans emoting and having fun, it appears unnatural at times (e.g. hand gestures, body position, facial expressions, etc.).

The strength of filmmaking still lies in the storytelling, emotional connection, performance of the actors/actresses, and a portrayal of human experiences that many can relate to. AI cannot yet replicate those aspects, though it might be good at generating scenes based on its knowledge of reality.

Filmmaking is also about working with people, creating images with a camera, writing original scripts, and editing how you want your scenes in post. These are things that AI cannot totally replace.

Where AI Video Could Be An Option

One of the best applications of AI generated videos is for special effects or SFX. It can be used in combination with video editing tools to create backdrops, extra characters on set, virtual objects, fantasy themes, and simulations.

Other things that AI can do offer safer ways to film a scene or create what would otherwise be impossible to do. This can be an option that helps filmmakers produce more amazing scenes that do not put any human lives at risk.

Figure 8. Prompt: Basketball through hoop then explodes.

In Figure 8, it makes sense to create this as a visual effect rather than producing it physically. It lessens liability to the producer, meaning no harm to people and no destruction of property. It will cost more if you are actually going to buy a basketball hoop just to blow it up, compared to generating it by computer.

SFX tools are already widely being used in video editing. There are editors around the world working in studios using software that performs these functions, but AI can add some more enhancements or provide new solutions to certain issues.

A drawback to using this type of AI is the lack of full creative control. You are basically allowing Sora to generate its way of seeing your prompt, based on its learned datasets.

It is best to embrace this technology as a partner tool. It can lead to a future where AI enhances the storytelling potential for traditional filmmakers. In other words, let it be an option available for anyone to use.

Potential For its Use

Here are some more use cases where Sora could be ideal.

  • Democratizing Video Production: This technology allows anyone, regardless of technical expertise, to produce engaging and informative videos. Content creators on social media and video sharing apps will benefit.
  • Content Creation Efficiency: It can boost efficiency like scripting a video is often faster than filming and editing. This makes text-to-video ideal for fast-paced content creation. Users can visualize what they want to produce to give them insightful ideas.
  • Visual Presentations: This can be used to convey ideas in meetings, conferences, seminars, webinars, and other presentations. A speaker can put the audience’s attention to a video that presents their ideas.
  • Educational Applications: Complex concepts can be explained more effectively through visually engaging videos generated from text. Teachers can present their views through visualizations that not only educate, but can also entertain and be more engaging to students.

Conclusion

OpenAI Sora represents a significant advancement in text-to-video generation with impressive capabilities and underlying technical innovations. As the technology matures certain considerations obviously need to be addressed.

If AI generated video becomes mainstream, anyone can become the next filmmaker on a low budget. No need for actors, expensive locations, and video equipment when it can all be generated digitally with a computer using AI software.

It is not perfect though, because quirks and imperfections are sometimes noticeable. With learning models getting bigger and more accurate in representing reality, things are going to improve and get better.

There are concerns that it can lead to exploitation of harmful content that is already littering the Internet. It can also be used to spread hate, misinformation, disinformation, and further spread deep fakes that could harm the reputation of others.

A bigger concern is its potential to replace humans in every aspect of production. That means it will affect talents and creators who are very much a big part of traditional filmmaking.

While Sora has the potential to revolutionize video production applications it will likely need to be regulated in some way. This is to make sure that negative effects are minimized or prevented.

Much discussion will be needed for ethical considerations and future development. What impact this will have and how it will change the entertainment and creative industry awaits us.

Disclaimer: The opinion provided is not based on experience, but from the reports regarding the product. Please DYOR always to verify information and create your own perspective.

--

--

Vincent T.
High-Definition Pro

Blockchain, AI, DevOps, Cybersecurity, Software Development, Engineering, Photography, Technology