DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Jia, Yuru; Hoyer, Lukas; Huang, Shengyu; Wang, Tianfu; Van Gool, Luc; Schindler, Konrad; Obukhov, Anton

doi:10.1007/978-3-031-72933-1_6

Yuru Jia^13,14,
Lukas Hoyer¹³,
Shengyu Huang¹³,
Tianfu Wang¹³,
Luc Van Gool^13,14,15,
Konrad Schindler¹³ &
…
Anton Obukhov¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15097))

Included in the following conference series:

European Conference on Computer Vision

50 Accesses

Abstract

Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding “yes”. We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Third, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods compared to the previous state-of-the-art methods. The source code and the generated dataset are available at dginstyle.github.io.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic data from diffusion models improves ImageNet classification. arXiv:2304.08466 (2023)
Bansal, A., et al.: Universal guidance for diffusion models. arXiv:2302.07121 (2023)
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: MultiDiffusion: fusing diffusion paths for controlled image generation. arXiv:2302.08113 (2023)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv:1809.11096 (2019)
Cai, S., et al.: DiffDreamer: towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Cai, S., Obukhov, A., Dai, D., Van Gool, L.: Pix2NeRF: unsupervised conditional P-GAN for single image to neural radiance fields translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Chen, K., et al.: GeoDiffusion: Text-prompted geometric control for object detection data generation. arXiv:2306.04607 (2023)
Chen, M., Laina, I., Vedaldi, A.: Training-free layout control with cross-attention guidance. arXiv:2304.03373 (2023)
Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: RobustNet: improving domain generalization in urban-scene segmentation via instance selective whitening. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2009)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. arXiv:2105.05233 (2021)
Ding, J., Xue, N., Xia, G.S., Schiele, B., Dai, D.: HGFormer: hierarchical grouping transformer for domain generalized semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Dunlap, L., Umino, A., Zhang, H., Yang, J., Gonzalez, J.E., Darrell, T.: Diversify your vision datasets with automatic diffusion-based augmentation. arXiv:2305.16289 (2023)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (2015)
Google Scholar
Goel, V., et al.: PAIR-diffusion: object-level image editing with structure-and-appearance paired diffusion models. arXiv:2303.17546 (2023)
Gong, R., Danelljan, M., Sun, H., Mangas, J.D., Gool, L.V.: Prompting diffusion representations for cross-domain semantic segmentation. arXiv:2307.02138 (2023)
Goodfellow, I.J., et al.: Generative adversarial networks. arXiv:1406.2661 (2014)
Ham, C., Hays, J., Lu, J., Singh, K.K., Zhang, Z., Hinz, T.: Modulating pretrained diffusion models for multimodal image synthesis. arXiv:2302.12764 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
He, R., et al.: Is synthetic data from generative models ready for image recognition? arXiv:2210.07574 (2023)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv:2207.12598 (2022)
Hoyer, L., Dai, D., Van Gool, L.: DAFormer: improving network architectures and training strategies for domain-adaptive semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Hoyer, L., Dai, D., Van Gool, L.: HRDA: context-aware high-resolution domain-adaptive semantic segmentation. arXiv:2204.13132 (2022)
Hoyer, L., Dai, D., Van Gool, L.: Domain adaptive and generalizable network architectures and training strategies for semantic image segmentation. IEEE TPAMI 46(1), 220–235 (2024)
Article Google Scholar
Huang, J., Guan, D., Xiao, A., Lu, S.: FSDR: frequency space domain randomization for domain generalization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., Zhou, J.: Composer: creative and controllable image synthesis with composable conditions. arXiv:2302.09778 (2023)
Huang, W., et al.: Style projected clustering for domain generalized semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation (2023)
Google Scholar
Kim, S., Kim, D.H., Kim, H.: Texture learning domain randomization for domain generalized segmentation. arXiv preprint arXiv:2303.11546 (2023)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2022)
Kondapaneni, N., Marks, M., Knott, M., Guimarães, R., Perona, P.: Text-image alignment for diffusion-based perception. arXiv:2310.00031 (2023)
Li, Z., Li, Y., Zhao, P., Song, R., Li, X., Yang, J.: Is synthetic data from diffusion models ready for knowledge distillation? arXiv:2305.12954 (2023)
Li, Z., Zhou, Q., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221 (2023)
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: RePaint: inpainting using denoising diffusion probabilistic models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
Mou, C., et al.: T2I-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv:2302.08453 (2023)
Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: enhancing learning and generalization capacities via IBN-Net. In: European Conference on Computer Vision (2018)
Google Scholar
Peng, D., Hu, P., Ke, Q., Liu, J.: Diffusion-based image translation with label guidance for domain adaptive semantic segmentation. In: IEEE/CVF International Conference on Computer Vision (2023)
Google Scholar
Peng, D., Lei, Y., Hayat, M., Guo, Y., Li, W.: Semantic-aware domain generalized segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Peng, D., Lei, Y., Liu, L., Zhang, P., Liu, J.: Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Trans. Image Process. 30 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. arXiv:1505.05770 (2016)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Roberts, M., et al.: Hypersim: a photorealistic synthetic dataset for holistic indoor scene understanding. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. arXiv:1505.04597 (2015)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. arXiv:2208.12242 (2023)
Saha, S., Hoyer, L., Obukhov, A., Dai, D., Van Gool, L.: EDAPS: enhanced domain-adaptive panoptic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Sakaridis, C., Dai, D., Van Gool, L.: Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In: IEEE International Conference on Computer Vision (2019)
Google Scholar
Sakaridis, C., Dai, D., Van Gool, L.: ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding. In: IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Sariyildiz, M.B., Alahari, K., Larlus, D., Kalantidis, Y.: Fake it till you make it: learning transferable representations from synthetic imagenet clones. arXiv:2212.08420 (2023)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (2022)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv:2011.13456 (2021)
Trabucco, B., Doherty, K., Gurinas, M., Salakhutdinov, R.: Effective data augmentation with diffusion models. arXiv:2302.07944 (2023)
Wang, T., Kanakis, M., Schindler, K., Van Gool, L., Obukhov, A.: Breathing new life into 3D assets with generative repainting. In: British Machine Vision Conference (2023)
Google Scholar
Wu, W., et al.: DatasetDM: synthesizing data with perception annotations using diffusion models. arXiv:2308.06160 (2023)
Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: DiffuMask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681 (2023)
Wu, Z., et al.: Synthetic data supervised salient object detection. In: ACM International Conference on Multimedia (2022)
Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems (2021)
Google Scholar
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., Mello, S.D.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv:2303.04803 (2023)
Xue, H., Huang, Z., Sun, Q., Song, L., Zhang, W.: Freestyle layout-to-image synthesis. arXiv:2303.14412 (2023)
Yang, L., Xu, X., Kang, B., Shi, Y., Zhao, H.: FreeMask: synthetic images with dense annotations make stronger segmentation models. arXiv preprint arXiv:2310.15160 (2023)
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: FreeDoM: training-free energy-guided conditional diffusion model. arXiv:2303.09833 (2023)
Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., Gong, B.: Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: IEEE/CVF International Conference on Computer Vision (2019)
Google Scholar
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
Zhang, M., et al.: DiffusionEngine: diffusion model is scalable data engine for object detection. arXiv:2309.03893 (2023)
Zhang, Y., et al.: DatasetGAN: efficient labeled data factory with minimal human effort. arXiv:2104.06490 (2021)
Zhao, S., et al.: Uni-ControlNet: all-in-one control to text-to-image diffusion models. arXiv:2305.16322 (2023)
Zhao, Y., Zhong, Z., Zhao, N., Sebe, N., Lee, G.H.: Style-hallucinated dual consistency learning for domain generalized semantic segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 535–552. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_31
Chapter Google Scholar
Zhong, Z., Zhao, Y., Lee, G.H., Sebe, N.: Adversarial style augmentation for domain generalized urban-scene segmentation. In: Advances in Neural Information Processing Systems (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

ETH Zürich, Zürich, Switzerland
Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler & Anton Obukhov
KU Leuven, Leuven, Belgium
Yuru Jia & Luc Van Gool
INSAIT Sofia, Sofia, Bulgaria
Luc Van Gool

Authors

Yuru Jia
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Hoyer
View author publications
You can also search for this author in PubMed Google Scholar
Shengyu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Tianfu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Schindler
View author publications
You can also search for this author in PubMed Google Scholar
Anton Obukhov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anton Obukhov .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 34729 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, Y. et al. (2025). DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15097. Springer, Cham. https://doi.org/10.1007/978-3-031-72933-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72933-1_6
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72932-4
Online ISBN: 978-3-031-72933-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control