CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Sijie Zhao

Yong Zhang*

Xiaodong Cun

Shaoshu Yang

Muyao Niu

Xiaoyu Li

Wenbo Hu

Ying Shan

Paper Code

Updates

[30 May 2024] Our preprint has been released on arXiv.

Preface

We propose CV-VAE that is compatible with existing image and video models trained with SD image VAE. Our video VAE provides a truly spatio-temporally compressed latent space for latent generative video models, as opposed to uniform frame sampling. Due to the latent space compatibility, a new video model can be trained efficiently with the pretrained image or video models as initialization. Besides, existing video models such as SVD can generate smoother videos with four times more frame using our video VAE by slightly fine-tuning a few parameters.

CV-VAE Framework

Overview of CV-VAE.

Generation Results

Citation

@article{zhao2024cvvae, title={CV-VAE: A Compatible Video VAE for Latent Generative Video Models}, author={Zhao, Sijie and Zhang, Yong and Cun, Xiaodong and Yang, Shaoshu and Niu, Muyao and Li, Xiaoyu and Hu, Wenbo and Shan, Ying}, journal={https://arxiv.org/abs/2405.20279}, year={2024} }

The website template was borrowed from Open X-Embodiment.