🤯 How Damn Hard Is Text-To-Video 🎬?
Stable Diffusion has significantly simplified the process of text-to-image generation, achieving impressive results. However, text-to-video generation has yet to reach the same level of advancement.
In a recent paper titled “Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models”, AI researchers from Carnegie Mellon University presented an innovative approach to address this.
The challenge
Generating videos that aligns with textual descriptions, capture motion, and maintain consistency poses unique complexities. Researchers are actively exploring techniques to overcome these hurdles and unlock the potential of text-to-video generation.
The Text2Video-Zero model, developed by Google AI and released in April 2023, represents the state-of-the-art (SOTA) in text-to-video generation. Text2Video-Zero is a zero-shot model, which means that it can generate videos from text descriptions without any additional training data.
Here’s an experiment by the authors with this model: