xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Qin, Can; Xia, Congying; Ramakrishnan, Krithika; Ryoo, Michael; Tu, Lifu; Feng, Yihao; Shu, Manli; Zhou, Honglu; Awadalla, Anas; Wang, Jun; Purushwalkam, Senthil; Xue, Le; Zhou, Yingbo; Wang, Huan; Savarese, Silvio; Niebles, Juan Carlos; Chen, Zeyuan; Xu, Ran; Xiong, Caiming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.12590 (cs)

[Submitted on 22 Aug 2024 (v1), last revised 31 Aug 2024 (this version, v2)]

Title:xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Authors:Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

View PDF HTML (experimental)

Abstract:We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

Comments:	Accepted by ECCV24 AI4VA
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.12590 [cs.CV]
	(or arXiv:2408.12590v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.12590

Submission history

From: Can Qin [view email]
[v1] Thu, 22 Aug 2024 17:55:22 UTC (32,753 KB)
[v2] Sat, 31 Aug 2024 05:12:09 UTC (32,750 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators