iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/978-3-031-72897-6_16
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos | SpringerLink
Skip to main content

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15128))

Included in the following conference series:

  • 67 Accesses

Abstract

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals—resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets, Ego4D and EPIC-KITCHENS, and we introduce Ego4D-Sounds—1.2M curated clips with action-audio correspondence. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our approach is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

C. Chen and P. Peng—Indicates equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://ego4dsounds.github.io.

  2. 2.

    https://github.com/gudgud96/frechet-audio-distance.

  3. 3.

    https://github.com/LAION-AI/CLAP.

  4. 4.

    https://github.com/timsainb/noisereduce.

References

  1. Blattmann, A., Rombach, R., Oktay, K., Ommer, B.: Retrieval-augmented diffusion models. ArXiv arxiv:2204.11824 (2022). https://api.semanticscholar.org/CorpusID:248377386

  2. Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:244954723

  3. Chen, C., Ashutosh, K., Girdhar, R., Harwath, D., Grauman, K.: Soundingactions: learning how actions sound from narrated egocentric videos. In: CVPR (2024)

    Google Scholar 

  4. Chen, C., et al.: Soundspaces 2.0: a simulation platform for visual-acoustic learning. In: NeurIPS (2023)

    Google Scholar 

  5. Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: a large-scale audio-visual dataset. In: ICASSP (2020)

    Google Scholar 

  6. Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., Gan, C.: Generating visually aligned sound from videos. TIP 29, 8292–8302 (2020)

    Google Scholar 

  7. Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: retrieval-augmented text-to-image generator. ArXiv arxiv:2209.14491 (2022). https://api.semanticscholar.org/CorpusID:252596087

  8. Clarke, S., et al.: Realimpact: a dataset of impact sound fields for real objects. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  9. Clarke, S., et al.: Diffimpact: differentiable rendering and identification of impact sounds. In: 5th Annual Conference on Robot Learning (2021)

    Google Scholar 

  10. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)

    Google Scholar 

  11. Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. ArXiv arxiv:2105.05233 (2021). https://api.semanticscholar.org/CorpusID:234357997

  12. Du, Y., Chen, Z., Salamon, J., Russell, B., Owens, A.: Conditional generation of audio from video via foley analogies. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2426–2436 (2023)

    Google Scholar 

  13. Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  14. Gan, C., et al.: Threedworld: a platform for interactive multi-modal physical simulation. In: NeurIPS Datasets and Benchmarks Track (2021)

    Google Scholar 

  15. Gandhi, D., Gupta, A., Pinto, L.: Swoosh! rattle! thump! - actions that sound. In: RSS (2022)

    Google Scholar 

  16. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)

    Google Scholar 

  17. Girdhar, R., El-et al.: Imagebind: one embedding space to bind them all. In: CVPR (2023)

    Google Scholar 

  18. Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18973–18990 (2022)

    Google Scholar 

  19. Grauman, K., et al.: Ego-exo4d: understanding skilled human activity from first- and third-person perspectives. In: CVPR (2024)

    Google Scholar 

  20. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W.: Realm: retrieval-augmented language model pre-training. ArXiv arxiv:2002.08909 (2020). https://api.semanticscholar.org/CorpusID:211204736

  21. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  22. Ho, J., Salimans, T.: Classifier-free diffusion guidance (2022)

    Google Scholar 

  23. Huang, C., Tian, Y., Kumar, A., Xu, C.: Egocentric audio-visual object localization. In: CVPR (2023)

    Google Scholar 

  24. Huang, R., et al.: Make-an-audio: text-to-audio generation with prompt-enhanced diffusion models. ArXiv arxiv:2301.12661 (2023). https://api.semanticscholar.org/CorpusID:256390046

  25. Huh, J., Chalk, J., Kazakos, E., Damen, D., Zisserman, A.: Epic-sounds: a large-scale dataset of actions that sound. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  26. Iashin, V., Rahtu, E.: Taming visually guided sound generation. In: BMVC (2021)

    Google Scholar 

  27. Jiang, H., Murdock, C., Ithapu, V.K.: Egocentric deep multi-channel audio-visual active speaker localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10544–10552 (2022)

    Google Scholar 

  28. Kay, W., et al.: The kinetics human action video dataset. CoRR arxiv:1705.06950 (2017). http://arxiv.org/abs/1705.06950

  29. Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5491–5500 (2019)

    Google Scholar 

  30. Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., Lewis, M.: Generalization through memorization: nearest neighbor language models. ArXiv arxiv:1911.00172 (2019). https://api.semanticscholar.org/CorpusID:207870430

  31. Kilgour, K., Zuluaga, M., Roblek, D., Sharifi, M.: Fréchet audio distance: a metric for evaluating music enhancement algorithms. arxiv (2018)

    Google Scholar 

  32. Kong, J., Kim, J., Bae, J.: Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)

    Google Scholar 

  33. Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. ArXiv arxiv:2009.09761 (2020). https://api.semanticscholar.org/CorpusID:221818900

  34. Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv arxiv:2005.11401 (2020). https://api.semanticscholar.org/CorpusID:218869575

  35. Lin, K.Q., et al.: Egocentric video-language pretraining. Adv. Neural Inf. Process. Syst. (2022)

    Google Scholar 

  36. Liu, H., et al.: Audioldm: text-to-audio generation with latent diffusion models. In: International Conference on Machine Learning (2023). https://api.semanticscholar.org/CorpusID:256390486

  37. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927 (2022)

  38. Luo, S., Yan, C., Hu, C., Zhao, H.: Diff-foley: synchronized video-to-audio synthesis with latent diffusion models. In: NeurIPS (2023)

    Google Scholar 

  39. Majumder, S., Al-Halah, Z., Grauman, K.: Learning spatial features from audio-visual correspondence in egocentric videos. In: CVPR (2024)

    Google Scholar 

  40. Mittal, H., Morgado, P., Jain, U., Gupta, A.: Learning state-aware visual representations from audible interactions. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=AhbTKBlM7X

  41. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:245335086

  42. Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)

    Google Scholar 

  43. Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M.A.: Grad-tts: a diffusion probabilistic model for text-to-speech. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:234483016

  44. Ramazanova, M., Escorcia, V., Heilbron, F.C., Zhao, C., Ghanem, B.: Owl (observe, watch, listen): localizing actions in egocentric video via audiovisual temporal context (2022)

    Google Scholar 

  45. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. ArXiv arxiv:2205.11487 (2022). https://api.semanticscholar.org/CorpusID:248986576

  46. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Neural Information Processing Systems (2019). https://api.semanticscholar.org/CorpusID:196470871

  47. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)

    Google Scholar 

  48. Su, K., Qian, K., Shlizerman, E., Torralba, A., Gan, C.: Physics-driven diffusion models for impact sound synthesis from videos. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9749–9759 (2023). https://api.semanticscholar.org/CorpusID:257805229

  49. Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. arxiv (201)

    Google Scholar 

  50. Wu*, Y., Chen*, K., Zhang*, T., Hui*, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (2023)

    Google Scholar 

  51. Yang, D., et al.: Diffsound: discrete diffusion model for text-to-sound generation. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1720–1733 (2022). https://api.semanticscholar.org/CorpusID:250698823

Download references

Acknowledgments

UT Austin is supported in part by the IFML NSF AI Institute. Wei-Ning Hsu helped advise the project only, and all the work and data processing were done outside of Meta.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changan Chen .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (pdf 702 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, C. et al. (2025). Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15128. Springer, Cham. https://doi.org/10.1007/978-3-031-72897-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72897-6_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72896-9

  • Online ISBN: 978-3-031-72897-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics