Long Video Generation with
Time-Agnostic VQGAN and Time-Sensitive Transformer

Songwei Ge^{1 *}

Thomas Hayes²

Harry Yang²

Xi Yin²

Guan Pang²

David Jacobs¹

Jia-Bin Huang^1,2

Devi Parikh^2,3

¹ University of Maryland, College Park

² Meta AI

³ Georgia Tech

* Work done primarily during an internship at Meta AI.

Paper (arXiv)

Overview Video

Code

Please check the video here if the video player above doesn't work for you.

Abstract

Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.

* Please use the latest version of Chrome to watch the videos. Some of the videos may not be played correctly in other web browsers such as Firefox.

Unconditional / Class-Conditional Long Video Generation Results

Please select below to see randomly selected results of different methods on different datasets

Each video contains 1024 frames of 128x128 resolution and is played by 8 fps in the first 2 seconds and 32 fps afterward.

Method:

[UCF-101 (Soomro et al. 2012)] dataset link

Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

In Submission

Code Coming Soon

Conditional Long Video Generation Results

Text- and audio-conditioned video generation results with 1024 frames on the MUGEN (fps=15, resolution=256x256) and AudioSet-Drum datasets (fps=30, resolution=64x64).


Left: real video. Right: synthetic video.	Left: real video. Right: synthetic video.

Diverse Short Video Generation Results

Randomly selected 16-frame videos with 128x128 resolution and 8 fps generated by TATS-base on different datasets.

UCF-101	Sky Time-lapse	TaiChi-HD

Text-Conditioned Video Manipulation

Real videos from the MUGEN dataset and manipulated videos generated by conditioning on the modified texts and the first 4 frames.


Mugen jumps down, and killed by / jumps over a bee		Mugen jumps up to the left to a platform , and killed by / kills a Slimeblock


Mugen jumps down to the left to a platform and killed by / jumps over a Slimeblue		Mugen walks to the right, and jumps to the right and kills / killed by a Wormpink

Ablation experiment on the VQGAN padding

Randomly selected videos with 1024 frames and 128x128 resolution generated by Vanilla Video VQGAN with zero paddings.

UCF-101

Sky Time-lapse

Taichi-HD

High-Resolution Long Video Generation Results

High resolution (256x256) and long (1024 frames) video generation results on the Sky Time-lapse (fps=15) dataset by applying sliding window both spatially and temporally.

BibTeX

@article{ge2022long,
  title     = {Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer},
  author    = {Ge, Songwei and Hayes, Thomas and Yang, Harry and Yin, Xi and Pang, Guan and Jacobs, David and Huang, Jia-Bin and Parikh, Devi},
  journal   = {arXiv preprint arXiv:2204.03638},
  year      = {2022}
}