Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Chen, Changan; Peng, Puyuan; Baid, Ami; Xue, Zihui; Hsu, Wei-Ning; Harwath, David; Grauman, Kristen

doi:10.1007/978-3-031-72897-6_16

Changan Chen¹³,
Puyuan Peng¹³,
Ami Baid¹³,
Zihui Xue¹³,
Wei-Ning Hsu¹⁴,
David Harwath¹³ &
…
Kristen Grauman¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

European Conference on Computer Vision

67 Accesses

Abstract

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals—resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds—1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

C. Chen and P. Peng—Indicates equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cogeneration of Innovative Audio-visual Content: A New Challenge for Computing Art

Article 15 January 2024

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Notes

References

Blattmann, A., Rombach, R., Oktay, K., Ommer, B.: Retrieval-augmented diffusion models. ArXiv arxiv:2204.11824 (2022). https://api.semanticscholar.org/CorpusID:248377386
Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:244954723
Chen, C., Ashutosh, K., Girdhar, R., Harwath, D., Grauman, K.: Soundingactions: learning how actions sound from narrated egocentric videos. In: CVPR (2024)
Google Scholar
Chen, C., et al.: Soundspaces 2.0: a simulation platform for visual-acoustic learning. In: NeurIPS (2023)
Google Scholar
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: a large-scale audio-visual dataset. In: ICASSP (2020)
Google Scholar
Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., Gan, C.: Generating visually aligned sound from videos. TIP 29, 8292–8302 (2020)
Google Scholar
Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: retrieval-augmented text-to-image generator. ArXiv arxiv:2209.14491 (2022). https://api.semanticscholar.org/CorpusID:252596087
Clarke, S., et al.: Realimpact: a dataset of impact sound fields for real objects. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Clarke, S., et al.: Diffimpact: differentiable rendering and identification of impact sounds. In: 5th Annual Conference on Robot Learning (2021)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. ArXiv arxiv:2105.05233 (2021). https://api.semanticscholar.org/CorpusID:234357997
Du, Y., Chen, Z., Salamon, J., Russell, B., Owens, A.: Conditional generation of audio from video via foley analogies. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2426–2436 (2023)
Google Scholar
Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Gan, C., et al.: Threedworld: a platform for interactive multi-modal physical simulation. In: NeurIPS Datasets and Benchmarks Track (2021)
Google Scholar
Gandhi, D., Gupta, A., Pinto, L.: Swoosh! rattle! thump! - actions that sound. In: RSS (2022)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)
Google Scholar
Girdhar, R., El-et al.: Imagebind: one embedding space to bind them all. In: CVPR (2023)
Google Scholar
Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18973–18990 (2022)
Google Scholar
Grauman, K., et al.: Ego-exo4d: understanding skilled human activity from first- and third-person perspectives. In: CVPR (2024)
Google Scholar
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training. ArXiv arxiv:2002.08909 (2020). https://api.semanticscholar.org/CorpusID:211204736
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022)
Google Scholar
Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: CVPR (2023)
Google Scholar
Huang, R., et al.: Make-an-audio: text-to-audio generation with prompt-enhanced diffusion models. ArXiv arxiv:2301.12661 (2023). https://api.semanticscholar.org/CorpusID:256390046
Huh, J., Chalk, J., Kazakos, E., Damen, D., Zisserman, A.: Epic-sounds: a large-scale dataset of actions that sound. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Iashin, V., Rahtu, E.: Taming visually guided sound generation. In: BMVC (2021)
Google Scholar
Jiang, H., Murdock, C., Ithapu, V.K.: Egocentric deep multi-channel audio-visual active speaker localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10552 (2022)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. CoRR arxiv:1705.06950 (2017). http://arxiv.org/abs/1705.06950
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5491–5500 (2019)
Google Scholar
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Generalization through memorization: nearest neighbor language models. ArXiv arxiv:1911.00172 (2019). https://api.semanticscholar.org/CorpusID:207870430
Kilgour, K., Zuluaga, M., Roblek, D., Sharifi, M.: Fréchet audio distance: a metric for evaluating music enhancement algorithms. arxiv (2018)
Google Scholar
Kong, J., Kim, J., Bae, J.: Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Google Scholar
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. ArXiv arxiv:2009.09761 (2020). https://api.semanticscholar.org/CorpusID:221818900
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv arxiv:2005.11401 (2020). https://api.semanticscholar.org/CorpusID:218869575
Lin, K.Q., et al.: Egocentric video-language pretraining. Adv. Neural Inf. Process. Syst. (2022)
Google Scholar
Liu, H., et al.: Audioldm: text-to-audio generation with latent diffusion models. In: International Conference on Machine Learning (2023). https://api.semanticscholar.org/CorpusID:256390486
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927 (2022)
Luo, S., Yan, C., Hu, C., Zhao, H.: Diff-foley: synchronized video-to-audio synthesis with latent diffusion models. In: NeurIPS (2023)
Google Scholar
Majumder, S., Al-Halah, Z., Grauman, K.: Learning spatial features from audio-visual correspondence in egocentric videos. In: CVPR (2024)
Google Scholar
Mittal, H., Morgado, P., Jain, U., Gupta, A.: Learning state-aware visual representations from audible interactions. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=AhbTKBlM7X
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:245335086
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
Google Scholar
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.A.: Grad-tts: a diffusion probabilistic model for text-to-speech. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:234483016
Ramazanova, M., Escorcia, V., Heilbron, F.C., Zhao, C., Ghanem, B.: Owl (observe, watch, listen): localizing actions in egocentric video via audiovisual temporal context (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. ArXiv arxiv:2205.11487 (2022). https://api.semanticscholar.org/CorpusID:248986576
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Neural Information Processing Systems (2019). https://api.semanticscholar.org/CorpusID:196470871
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)
Google Scholar
Su, K., Qian, K., Shlizerman, E., Torralba, A., Gan, C.: Physics-driven diffusion models for impact sound synthesis from videos. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9749–9759 (2023). https://api.semanticscholar.org/CorpusID:257805229
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. arxiv (201)
Google Scholar
Wu*, Y., Chen*, K., Zhang*, T., Hui*, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2023)
Google Scholar
Yang, D., et al.: Diffsound: discrete diffusion model for text-to-sound generation. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1720–1733 (2022). https://api.semanticscholar.org/CorpusID:250698823

Download references

Acknowledgments

UT Austin is supported in part by the IFML NSF AI Institute. Wei-Ning Hsu helped advise the project only, and all the work and data processing were done outside of Meta.

Author information

Authors and Affiliations

University of Texas at Austin, Austin, USA
Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, David Harwath & Kristen Grauman
FAIR, Meta, Menlo Park, USA
Wei-Ning Hsu

Authors

Changan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Puyuan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Ami Baid
View author publications
You can also search for this author in PubMed Google Scholar
Zihui Xue
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ning Hsu
View author publications
You can also search for this author in PubMed Google Scholar
David Harwath
View author publications
You can also search for this author in PubMed Google Scholar
Kristen Grauman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changan Chen .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (pdf 702 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, C. et al. (2025). Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-72897-6_16
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cogeneration of Innovative Audio-visual Content: A New Challenge for Computing Art

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 2 (pdf 702 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cogeneration of Innovative Audio-visual Content: A New Challenge for Computing Art

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 2 (pdf 702 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation