iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/978-3-031-72784-9_9
Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation | SpringerLink
Skip to main content

Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA (Mastering Video Outpainting Through Input-Specific Adaptation), a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies are proposed to better leverage the diffusion model’s generative prior and the acquired video patterns from source videos for inference. Extensive evaluations underscore MOTIA ’s superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning. More details are available at https://be-your-outpainter.github.io/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arora, R., Lee, Y.J.: SinGAN-GIF: learning a generative video model from a single GIF. In: CVPR, pp. 1310–1319 (2021)

    Google Scholar 

  2. Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: CVPR, pp. 18208–18218 (2022)

    Google Scholar 

  3. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)

    Article  Google Scholar 

  4. Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR, pp. 22563–22575 (2023)

    Google Scholar 

  5. Dehan, L., Van Ranst, W., Vandewalle, P., Goedemé, T.: Complete and temporally consistent video outpainting. In: CVPRW, pp. 687–695 (2022)

    Google Scholar 

  6. Fan, F., et al.: Hierarchical masked 3D diffusion model for video outpainting. In: ACM MM, pp. 7890–7900 (2023)

    Google Scholar 

  7. Gao, C., Saraf, A., Huang, J.-B., Kopf, J.: Flow-edge guided video completion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 713–729. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_42

    Chapter  Google Scholar 

  8. Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  9. Ho, J., et al.: Imagen video: high definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  10. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  11. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  12. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. arXiv:2204.03458 (2022)

  13. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  14. Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: RePaint: inpainting using denoising diffusion probabilistic models. In: CVPR, pp. 11461–11471 (2022)

    Google Scholar 

  15. Ma, Y., et al.: Follow your pose: pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186 (2023)

  16. Ma, Y., et al.: Follow-your-click: open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268 (2024)

  17. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  18. Nikankin, Y., Haim, N., Irani, M.: SinFusion: training diffusion models on a single image or video. arXiv preprint arXiv:2211.11743 (2022)

  19. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

  20. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)

    Google Scholar 

  21. Rout, L., Parulekar, A., Caramanis, C., Shakkottai, S.: A theoretical justification for image inpainting using denoising diffusion probabilistic models. arXiv preprint arXiv:2302.01217 (2023)

  22. Shaham, T.R., Dekel, T., Michaeli, T.: SinGAN: learning a generative model from a single natural image. In: ICCV, pp. 4570–4580 (2019)

    Google Scholar 

  23. Shen, D., Song, G., Xue, Z., Wang, F.Y., Liu, Y.: Rethinking the spatial inconsistency in classifier-free diffusion guidance. In: CVPR, pp. 9370–9379, June 2024

    Google Scholar 

  24. Shi, X., et al.: Motion-I2V: consistent and controllable image-to-video generation with explicit motion modeling. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

    Google Scholar 

  25. Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)

  26. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  27. Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  28. Voleti, V., Jolicoeur-Martineau, A., Pal, C.: MCVD: masked conditional video diffusion for prediction, generation, and interpolation. In: Neural Information Processing Systems (2022)

    Google Scholar 

  29. Wang, F.Y., Chen, W., Song, G., Ye, H.J., Liu, Y., Li, H.: Gen-L-video: multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264 (2023)

  30. Wang, F.Y., et al.: AnimateLCM: accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769 (2024)

  31. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  32. Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)

  33. Xu, N., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)

  34. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR, pp. 1316–1324 (2018)

    Google Scholar 

  35. Yu, L., et al.: MAGVIT: masked generative video transformer. In: CVPR, pp. 10459–10469 (2023)

    Google Scholar 

  36. Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)

  37. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)

    Google Scholar 

Download references

Acknowledge

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fu-Yun Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 28145 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, FY. et al. (2025). Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15102. Springer, Cham. https://doi.org/10.1007/978-3-031-72784-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72784-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72783-2

  • Online ISBN: 978-3-031-72784-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics