🤯 How Damn Hard Is Text-To-Video 🎬?

Angelina Yang
3 min readMay 22

Stable Diffusion has significantly simplified the process of text-to-image generation, achieving impressive results. However, text-to-video generation has yet to reach the same level of advancement.

Source: Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators (March, 2023)

In a recent paper titled “Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models”, AI researchers from Carnegie Mellon University presented an innovative approach to address this.

The challenge

Generating videos that aligns with textual descriptions, capture motion, and maintain consistency poses unique complexities. Researchers are actively exploring techniques to overcome these hurdles and unlock the potential of text-to-video generation.

The Text2Video-Zero model, developed by Google AI and released in April 2023, represents the state-of-the-art (SOTA) in text-to-video generation. Text2Video-Zero is a zero-shot model, which means that it can generate videos from text descriptions without any additional training data.

Here’s an experiment by the authors with this model:


The authors would like to see the motion of walking from the left side to the right on the beach, and hence they added the prompt “from left to right” explicitly.


The resulting frames still failed to capture the desired motion accurately. There may be several challenges here:

  • Perhaps the prompt “left to right” is not interpreted correctly due to the “zero-shot” nature of the model.
  • The frames are zooming in on the person’s feet rather than the full body, showing misalignment with the desired motion and visual consistency.

The proposal

The authors used the Text2Video-Zero model as a baseline, and proposed to incorporate sketches (the STF paper) as additional inputs to introduce a level of “control” over the generated video. This approach resembles enhancing…

Angelina Yang