Member-only story

🤯 How Damn Hard Is Text-To-Video 🎬?

3 min readMay 22, 2023

Stable Diffusion has significantly simplified the process of text-to-image generation, achieving impressive results. However, text-to-video generation has yet to reach the same level of advancement.

Source: Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (March, 2023)

In a recent paper titled “Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models”, AI researchers from Carnegie Mellon University presented an innovative approach to address this.

The challenge

Generating videos that aligns with textual descriptions, capture motion, and maintain consistency poses unique complexities. Researchers are actively exploring techniques to overcome these hurdles and unlock the potential of text-to-video generation.

The Text2Video-Zero model, developed by Google AI and released in April 2023, represents the state-of-the-art (SOTA) in text-to-video generation. Text2Video-Zero is a zero-shot model, which means that it can generate videos from text descriptions without any additional training data.

Here’s an experiment by the authors with this model:

🤯 How Damn Hard Is Text-To-Video 🎬?

The challenge

Written by Angelina Yang

No responses yet