Long Video Generation with
Time-Agnostic VQGAN and Time-Sensitive Transformer

Songwei Ge1 *
Thomas Hayes2
Harry Yang2
Xi Yin2
Guan Pang2
David Jacobs1
Jia-Bin Huang1,2
Devi Parikh2,3

1 University of Maryland, College Park
2 Meta AI
3 Georgia Tech

* Work done primarily during an internship at Meta AI.

Please check the video here if the video player above doesn't work for you.


Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.

* Please use the latest version of Chrome to watch the videos. Some of the videos may not be played correctly in other web browsers such as Firefox.

Unconditional / Class-Conditional Long Video Generation Results

Please select below to see randomly selected results of different methods on different datasets
Each video contains 1024 frames of 128x128 resolution and is played by 8 fps in the first 2 seconds and 32 fps afterward.


Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh
Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
In Submission

Conditional Long Video Generation Results

Text- and audio-conditioned video generation results with 1024 frames on the MUGEN (fps=15, resolution=256x256) and AudioSet-Drum datasets (fps=30, resolution=64x64).

Left: real video. Right: synthetic video. Left: real video. Right: synthetic video.

Diverse Short Video Generation Results

Randomly selected 16-frame videos with 128x128 resolution and 8 fps generated by TATS-base on different datasets.

UCF-101 Sky Time-lapse TaiChi-HD

Text-Conditioned Video Manipulation

Real videos from the MUGEN dataset and manipulated videos generated by conditioning on the modified texts and the first 4 frames.

Mugen jumps down, and
killed by / jumps over a bee
Mugen jumps up to the left to a platform , and
killed by / kills a Slimeblock

Mugen jumps down to the left to a platform and killed by / jumps over a Slimeblue Mugen walks to the right, and jumps to the right and kills / killed by a Wormpink

Ablation experiment on the VQGAN padding

Randomly selected videos with 1024 frames and 128x128 resolution generated by Vanilla Video VQGAN with zero paddings.


Sky Time-lapse


High-Resolution Long Video Generation Results

High resolution (256x256) and long (1024 frames) video generation results on the Sky Time-lapse (fps=15) dataset by applying sliding window both spatially and temporally.


  title     = {Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer},
  author    = {Ge, Songwei and Hayes, Thomas and Yang, Harry and Yin, Xi and Pang, Guan and Jacobs, David and Huang, Jia-Bin and Parikh, Devi},
  journal   = {arXiv preprint arXiv:2204.03638},
  year      = {2022}