Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation

Wang, Fu-Yun; Wu, Xiaoshi; Huang, Zhaoyang; Shi, Xiaoyu; Shen, Dazhong; Song, Guanglu; Liu, Yu; Li, Hongsheng

doi:10.1007/978-3-031-72784-9_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15102))

Included in the following conference series:

European Conference on Computer Vision

48 Accesses

Abstract

Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA (Mastering Video Outpainting Through Input-Specific Adaptation), a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies are proposed to better leverage the diffusion model’s generative prior and the acquired video patterns from source videos for inference. Extensive evaluations underscore MOTIA ’s superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning. More details are available at https://be-your-outpainter.github.io/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arora, R., Lee, Y.J.: SinGAN-GIF: learning a generative video model from a single GIF. In: CVPR, pp. 1310–1319 (2021)
Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR, pp. 18208–18218 (2022)
Google Scholar
Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)
Article Google Scholar
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023)
Google Scholar
Dehan, L., Van Ranst, W., Vandewalle, P., Goedemé, T.: Complete and temporally consistent video outpainting. In: CVPRW, pp. 687–695 (2022)
Google Scholar
Fan, F., et al.: Hierarchical masked 3D diffusion model for video outpainting. In: ACM MM, pp. 7890–7900 (2023)
Google Scholar
Gao, C., Saraf, A., Huang, J.-B., Kopf, J.: Flow-edge guided video completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 713–729. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_42
Chapter Google Scholar
Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900. PMLR (2022)
Google Scholar
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: RePaint: inpainting using denoising diffusion probabilistic models. In: CVPR, pp. 11461–11471 (2022)
Google Scholar
Ma, Y., et al.: Follow your pose: pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186 (2023)
Ma, Y., et al.: Follow-your-click: open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268 (2024)
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Nikankin, Y., Haim, N., Irani, M.: SinFusion: training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743 (2022)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Rout, L., Parulekar, A., Caramanis, C., Shakkottai, S.: A theoretical justification for image inpainting using denoising diffusion probabilistic models. arXiv preprint arXiv:2302.01217 (2023)
Shaham, T.R., Dekel, T., Michaeli, T.: SinGAN: learning a generative model from a single natural image. In: ICCV, pp. 4570–4580 (2019)
Google Scholar
Shen, D., Song, G., Xue, Z., Wang, F.Y., Liu, Y.: Rethinking the spatial inconsistency in classifier-free diffusion guidance. In: CVPR, pp. 9370–9379, June 2024
Google Scholar
Shi, X., et al.: Motion-I2V: consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)
Google Scholar
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Voleti, V., Jolicoeur-Martineau, A., Pal, C.: MCVD: masked conditional video diffusion for prediction, generation, and interpolation. In: Neural Information Processing Systems (2022)
Google Scholar
Wang, F.Y., Chen, W., Song, G., Ye, H.J., Liu, Y., Li, H.: Gen-L-video: multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264 (2023)
Wang, F.Y., et al.: AnimateLCM: accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769 (2024)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)
Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR, pp. 1316–1324 (2018)
Google Scholar
Yu, L., et al.: MAGVIT: masked generative video transformer. In: CVPR, pp. 10459–10469 (2023)
Google Scholar
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)
Google Scholar

Download references

Acknowledge

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

MMLab, CUHK, Hong Kong, China
Fu-Yun Wang, Xiaoshi Wu, Xiaoyu Shi & Hongsheng Li
Avolution AI, Chicago, USA
Zhaoyang Huang
Shanghai AI Lab, Shanghai, China
Dazhong Shen, Yu Liu & Hongsheng Li
SenseTime Research, Beijing, China
Guanglu Song
CPII under InnoHK, Hong Kong, China
Hongsheng Li

Authors

Fu-Yun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Shi
View author publications
You can also search for this author in PubMed Google Scholar
Dazhong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Guanglu Song
View author publications
You can also search for this author in PubMed Google Scholar
Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fu-Yun Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 28145 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, FY. et al. (2025). Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72784-9_9
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72783-2
Online ISBN: 978-3-031-72784-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics