Abstract
Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals—resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds—1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
C. Chen and P. Peng—Indicates equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blattmann, A., Rombach, R., Oktay, K., Ommer, B.: Retrieval-augmented diffusion models. ArXiv arxiv:2204.11824 (2022). https://api.semanticscholar.org/CorpusID:248377386
Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:244954723
Chen, C., Ashutosh, K., Girdhar, R., Harwath, D., Grauman, K.: Soundingactions: learning how actions sound from narrated egocentric videos. In: CVPR (2024)
Chen, C., et al.: Soundspaces 2.0: a simulation platform for visual-acoustic learning. In: NeurIPS (2023)
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: a large-scale audio-visual dataset. In: ICASSP (2020)
Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., Gan, C.: Generating visually aligned sound from videos. TIP 29, 8292–8302 (2020)
Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: retrieval-augmented text-to-image generator. ArXiv arxiv:2209.14491 (2022). https://api.semanticscholar.org/CorpusID:252596087
Clarke, S., et al.: Realimpact: a dataset of impact sound fields for real objects. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2023)
Clarke, S., et al.: Diffimpact: differentiable rendering and identification of impact sounds. In: 5th Annual Conference on Robot Learning (2021)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)
Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. ArXiv arxiv:2105.05233 (2021). https://api.semanticscholar.org/CorpusID:234357997
Du, Y., Chen, Z., Salamon, J., Russell, B., Owens, A.: Conditional generation of audio from video via foley analogies. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2426–2436 (2023)
Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Gan, C., et al.: Threedworld: a platform for interactive multi-modal physical simulation. In: NeurIPS Datasets and Benchmarks Track (2021)
Gandhi, D., Gupta, A., Pinto, L.: Swoosh! rattle! thump! - actions that sound. In: RSS (2022)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)
Girdhar, R., El-et al.: Imagebind: one embedding space to bind them all. In: CVPR (2023)
Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18973–18990 (2022)
Grauman, K., et al.: Ego-exo4d: understanding skilled human activity from first- and third-person perspectives. In: CVPR (2024)
Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training. ArXiv arxiv:2002.08909 (2020). https://api.semanticscholar.org/CorpusID:211204736
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022)
Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: CVPR (2023)
Huang, R., et al.: Make-an-audio: text-to-audio generation with prompt-enhanced diffusion models. ArXiv arxiv:2301.12661 (2023). https://api.semanticscholar.org/CorpusID:256390046
Huh, J., Chalk, J., Kazakos, E., Damen, D., Zisserman, A.: Epic-sounds: a large-scale dataset of actions that sound. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Iashin, V., Rahtu, E.: Taming visually guided sound generation. In: BMVC (2021)
Jiang, H., Murdock, C., Ithapu, V.K.: Egocentric deep multi-channel audio-visual active speaker localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10552 (2022)
Kay, W., et al.: The kinetics human action video dataset. CoRR arxiv:1705.06950 (2017). http://arxiv.org/abs/1705.06950
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5491–5500 (2019)
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Generalization through memorization: nearest neighbor language models. ArXiv arxiv:1911.00172 (2019). https://api.semanticscholar.org/CorpusID:207870430
Kilgour, K., Zuluaga, M., Roblek, D., Sharifi, M.: Fréchet audio distance: a metric for evaluating music enhancement algorithms. arxiv (2018)
Kong, J., Kim, J., Bae, J.: Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. ArXiv arxiv:2009.09761 (2020). https://api.semanticscholar.org/CorpusID:221818900
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv arxiv:2005.11401 (2020). https://api.semanticscholar.org/CorpusID:218869575
Lin, K.Q., et al.: Egocentric video-language pretraining. Adv. Neural Inf. Process. Syst. (2022)
Liu, H., et al.: Audioldm: text-to-audio generation with latent diffusion models. In: International Conference on Machine Learning (2023). https://api.semanticscholar.org/CorpusID:256390486
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927 (2022)
Luo, S., Yan, C., Hu, C., Zhao, H.: Diff-foley: synchronized video-to-audio synthesis with latent diffusion models. In: NeurIPS (2023)
Majumder, S., Al-Halah, Z., Grauman, K.: Learning spatial features from audio-visual correspondence in egocentric videos. In: CVPR (2024)
Mittal, H., Morgado, P., Jain, U., Gupta, A.: Learning state-aware visual representations from audible interactions. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=AhbTKBlM7X
Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:245335086
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.A.: Grad-tts: a diffusion probabilistic model for text-to-speech. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:234483016
Ramazanova, M., Escorcia, V., Heilbron, F.C., Zhao, C., Ghanem, B.: Owl (observe, watch, listen): localizing actions in egocentric video via audiovisual temporal context (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. ArXiv arxiv:2205.11487 (2022). https://api.semanticscholar.org/CorpusID:248986576
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Neural Information Processing Systems (2019). https://api.semanticscholar.org/CorpusID:196470871
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)
Su, K., Qian, K., Shlizerman, E., Torralba, A., Gan, C.: Physics-driven diffusion models for impact sound synthesis from videos. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9749–9759 (2023). https://api.semanticscholar.org/CorpusID:257805229
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. arxiv (201)
Wu*, Y., Chen*, K., Zhang*, T., Hui*, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2023)
Yang, D., et al.: Diffsound: discrete diffusion model for text-to-sound generation. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1720–1733 (2022). https://api.semanticscholar.org/CorpusID:250698823
Acknowledgments
UT Austin is supported in part by the IFML NSF AI Institute. Wei-Ning Hsu helped advise the project only, and all the work and data processing were done outside of Meta.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, C. et al. (2025). Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-72897-6_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72896-9
Online ISBN: 978-3-031-72897-6
eBook Packages: Computer ScienceComputer Science (R0)