Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Guo, Lanqing; He, Yingqing; Chen, Haoxin; Xia, Menghan; Cun, Xiaodong; Wang, Yufei; Huang, Siyu; Zhang, Yong; Wang, Xintao; Chen, Qifeng; Shan, Ying; Wen, Bihan

doi:10.1007/978-3-031-72764-1_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15094))

Included in the following conference series:

European Conference on Computer Vision

52 Accesses

Abstract

Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5{\times }$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

L. Guo and Y. He—Equal Contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Any-Resolution Training for High-Resolution Image Synthesis

FMBoost: Boosting Latent Diffusion with Flow Matching

Exploiting Diffusion Prior for Real-World Image Super-Resolution

Article 11 July 2024

References

Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738 (2021)
Google Scholar
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)
Google Scholar
Bond-Taylor, S., Willcocks, C.G.: $\infty $-diff: infinite resolution diffusion with subsampled mollified states. arXiv preprint arXiv:2303.18242 (2023)
Chai, L., Gharbi, M., Shechtman, E., Isola, P., Zhang, R.: Any-resolution training for high-resolution image synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13676, pp. 170–188. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19787-1_10
Chapter Google Scholar
Chen, T.: On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972 (2023)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. Adv. Neural. Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Diffusion, S.: Stable diffusion 2-1 base (2022). https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned.ckpt
Gu, J., Zhai, S., Zhang, Y., Susskind, J., Jaitly, N.: Matryoshka diffusion models. arXiv preprint arXiv:2310.15111 (2023)
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
Google Scholar
He, Y., et al.: ScaleCrafter: tuning-free higher-resolution visual generation with diffusion models. arXiv preprint arXiv:2310.07702 (2023)
He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221 (2022)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23(1), 2249–2281 (2022)
MathSciNet Google Scholar
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 47–1 (2022)
MathSciNet Google Scholar
Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: end-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093 (2023)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jin, Z., Shen, X., Li, B., Xue, X.: Training-free diffusion model adaptation for variable-sized text-to-image synthesis. arXiv preprint arXiv:2306.08645 (2023)
Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in GAN evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11410–11420 (2022)
Google Scholar
Podell, D., et al.: SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Saharia, C., et al.: Palette: image-to-image diffusion models. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models (2022)
Google Scholar
Si, C., Huang, Z., Jiang, Y., Liu, Z.: FreeU: free lunch in diffusion U-Net. arXiv preprint arXiv:2309.11497 (2023)
Singer, U., et al.: Make-a-video: text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
Su, X., Song, J., Meng, C., Ermon, S.: Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382 (2022)
Teng, J., et al.: Relay diffusion: unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350 (2023)
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. ICLR (2019)
Google Scholar
Wang, Y., et al.: LAVIE: high-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023)
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633 (2023)
Google Scholar
Xie, E., et al.: DiffFit: unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. arXiv preprint arXiv:2304.06648 (2023)
Yu, S., Sohn, K., Kim, S., Shin, J.: Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18456–18466 (2023)
Google Scholar
Zhang, D.J., et al.: Show-1: marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818 (2023)
Zheng, Q., et al.: Any-size-diffusion: toward efficient text-driven synthesis for any-size HD images. arXiv preprint arXiv:2308.16582 (2023)

Download references

Acknowledgements

This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at Nanyang Technological University in Singapore. This work was supported in part by the National Research Foundation Singapore Competitive Research Program (award number CRP29-2022-0003), and in part by the National Key R&D Program of China under grant number 2022ZD0161501.

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Lanqing Guo, Yufei Wang & Bihan Wen
Tencent AI Lab, Shenzhen, China
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yong Zhang, Xintao Wang & Ying Shan
The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Yingqing He & Qifeng Chen
Clemson University, Clemson, USA
Siyu Huang

Authors

Lanqing Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yingqing He
View author publications
You can also search for this author in PubMed Google Scholar
Haoxin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Menghan Xia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Cun
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xintao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qifeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ying Shan
View author publications
You can also search for this author in PubMed Google Scholar
Bihan Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bihan Wen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12831 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, L. et al. (2025). Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15094. Springer, Cham. https://doi.org/10.1007/978-3-031-72764-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-72764-1_3
Published: 25 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72763-4
Online ISBN: 978-3-031-72764-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Any-Resolution Training for High-Resolution Image Synthesis

FMBoost: Boosting Latent Diffusion with Flow Matching

Exploiting Diffusion Prior for Real-World Image Super-Resolution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12831 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Any-Resolution Training for High-Resolution Image Synthesis

FMBoost: Boosting Latent Diffusion with Flow Matching

Exploiting Diffusion Prior for Real-World Image Super-Resolution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12831 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation